This document describes an HTML2Presentation tool that was designed to summarize HTML papers from the CHI 96 conference proceedings. It parses the HTML, divides it into sections and subsections, extracts text and images, and generates a LaTeX presentation PDF. The tool's parser grabs tags and text, classifies sections, and outputs a JSON array. The summarizer takes text and limits summaries to 4-5 sentences. A PDF generator takes the JSON and combines duplicate sections. A web interface is provided for easy use, and potential use cases include automatically generating paper presentations and summarizing blogs. Feedback on the tool is welcome.
2. Introduction
This tool was designed to summarize HTML version of the papers published in
the proceedings of CHI 96 - Conference on Human Factors in Computing
Systems, 1996: http://sigchi.org/chi96/proceedings/papers.htm
Since we've written a parser of our own to parse the HTML source, we realize
that its not very generic and may not work for all the inputs apart from the ones
in these proceedings.
Since this is just a Proof-Of-Concept application, don't expect too much of error
handling. But we try to provide some basic error messages when something
fails.
3. How does it work?
First, we parse the HTML of the paper, so as to distinguish between HTML
tags and the actual text.
Next, divide the paper into sections and subsections based on the heading in
the paper. For instance, text in between first <h1> becomes section 1, text
under first <h2> becomes section 1.1 and so on.
Now, we extract the actual text for each subsection and ignore any other tags
like <div>, <span>, etc.
4. How does it work?
We pass the extracted plain text for each subsection to the summarizer so
that we get a brief summary of each subsection. The size of the summary
should be limited to be about 4-5 sentences.
Along with the text, we also extract relevant images and tables from the
paper and insert them into the presentation under relevant sections.
Once we have heading for each section and content under it from the parser,
we just need to pass the appropriate arguments to Latexslides
5. How does it work?
(a Python tool) which generates the presentation in LaTeX.
Finally we obtain the presentation in .pdf format from the LaTeX source
using pdflatex.
6. Features of the parser
Grabs all the tags and their relevant text in a well formatted html page.
Classifies the text into proper sections. Like:
Intelligently detects the type of tag and assigns proper at attribute to it.
Output is out in as a well-formatted JSON array, so that it can be used
independently in other applications.
It also takes care of hyperlinks and includes them in the appropriate
section/subsection.
7. Features of the summarizer
Summarizer takes as input a blob of text as input and outputs an array of
sentences that summarizes the given text.
The summarizer takes as input maximum number of sentences to be
returned as output, so its flexible in this regard.
Calculates the importance of a sentence by comparing it with all other
sentences in the given text and assigning appropriate weights to all the
words.
Since were using stop-words from nltk(natural language toolkit), it can be
extended to any natural language.
8. Features of the PDF Generator
It takes a JSON file as input, so it is independent in way that it can work with
any JSON input, the only condition being that the JSON file must follow our
standard format.
If a particular section is appearing several times in the input with different
contents, it automatically combines their content into one section.
It generates the `tex` file before converting it to PDF so user can download
the PDF. Also, he can edit the `tex` file itself if he wants so.
9. Web Interface
Weve also developed a web interface to use this tool in an easy manner:
http://web.iiit.ac.in/~chandan.singh/html2presentation/
10. Possible Use Cases
Can be used to automatically generate presentations of ones paper.
Can also be tweaked easily to summarize blogs.
Since the code is written in a modular manner, more modules can easily be
added or removed to enhance the user interface.
11. Thank You!
Any feedback/suggestions are welcome.
You can contact us via our contact page: http://web.iiit.ac.in/~chandan.
singh/html2presentation/team/