ݺߣ

ݺߣShare a Scribd company logo
HTML2Presentation
IRE Major Project
- Chandan Singh
- Harsh Vardhan Shukla
- Nehal J Wani
- Rahul Patidar
Introduction
 This tool was designed to summarize HTML version of the papers published in
the proceedings of CHI 96 - Conference on Human Factors in Computing
Systems, 1996: http://sigchi.org/chi96/proceedings/papers.htm
 Since we've written a parser of our own to parse the HTML source, we realize
that its not very generic and may not work for all the inputs apart from the ones
in these proceedings.
 Since this is just a Proof-Of-Concept application, don't expect too much of error
handling. But we try to provide some basic error messages when something
fails.
How does it work?
 First, we parse the HTML of the paper, so as to distinguish between HTML
tags and the actual text.
 Next, divide the paper into sections and subsections based on the heading in
the paper. For instance, text in between first <h1> becomes section 1, text
under first <h2> becomes section 1.1 and so on.
 Now, we extract the actual text for each subsection and ignore any other tags
like <div>, <span>, etc.
How does it work?
 We pass the extracted plain text for each subsection to the summarizer so
that we get a brief summary of each subsection. The size of the summary
should be limited to be about 4-5 sentences.
 Along with the text, we also extract relevant images and tables from the
paper and insert them into the presentation under relevant sections.
 Once we have heading for each section and content under it from the parser,
we just need to pass the appropriate arguments to Latexslides
How does it work?
(a Python tool) which generates the presentation in LaTeX.
 Finally we obtain the presentation in .pdf format from the LaTeX source
using pdflatex.
Features of the parser
 Grabs all the tags and their relevant text in a well formatted html page.
 Classifies the text into proper sections. Like:
 Intelligently detects the type of tag and assigns proper at attribute to it.
 Output is out in as a well-formatted JSON array, so that it can be used
independently in other applications.
 It also takes care of hyperlinks and includes them in the appropriate
section/subsection.
Features of the summarizer
 Summarizer takes as input a blob of text as input and outputs an array of
sentences that summarizes the given text.
 The summarizer takes as input maximum number of sentences to be
returned as output, so its flexible in this regard.
 Calculates the importance of a sentence by comparing it with all other
sentences in the given text and assigning appropriate weights to all the
words.
 Since were using stop-words from nltk(natural language toolkit), it can be
extended to any natural language.
Features of the PDF Generator
 It takes a JSON file as input, so it is independent in way that it can work with
any JSON input, the only condition being that the JSON file must follow our
standard format.
 If a particular section is appearing several times in the input with different
contents, it automatically combines their content into one section.
 It generates the `tex` file before converting it to PDF so user can download
the PDF. Also, he can edit the `tex` file itself if he wants so.
Web Interface
 Weve also developed a web interface to use this tool in an easy manner:
http://web.iiit.ac.in/~chandan.singh/html2presentation/
Possible Use Cases
 Can be used to automatically generate presentations of ones paper.
 Can also be tweaked easily to summarize blogs.
 Since the code is written in a modular manner, more modules can easily be
added or removed to enhance the user interface.
Thank You!
 Any feedback/suggestions are welcome.
 You can contact us via our contact page: http://web.iiit.ac.in/~chandan.
singh/html2presentation/team/

More Related Content

Html2 presentation

  • 1. HTML2Presentation IRE Major Project - Chandan Singh - Harsh Vardhan Shukla - Nehal J Wani - Rahul Patidar
  • 2. Introduction This tool was designed to summarize HTML version of the papers published in the proceedings of CHI 96 - Conference on Human Factors in Computing Systems, 1996: http://sigchi.org/chi96/proceedings/papers.htm Since we've written a parser of our own to parse the HTML source, we realize that its not very generic and may not work for all the inputs apart from the ones in these proceedings. Since this is just a Proof-Of-Concept application, don't expect too much of error handling. But we try to provide some basic error messages when something fails.
  • 3. How does it work? First, we parse the HTML of the paper, so as to distinguish between HTML tags and the actual text. Next, divide the paper into sections and subsections based on the heading in the paper. For instance, text in between first <h1> becomes section 1, text under first <h2> becomes section 1.1 and so on. Now, we extract the actual text for each subsection and ignore any other tags like <div>, <span>, etc.
  • 4. How does it work? We pass the extracted plain text for each subsection to the summarizer so that we get a brief summary of each subsection. The size of the summary should be limited to be about 4-5 sentences. Along with the text, we also extract relevant images and tables from the paper and insert them into the presentation under relevant sections. Once we have heading for each section and content under it from the parser, we just need to pass the appropriate arguments to Latexslides
  • 5. How does it work? (a Python tool) which generates the presentation in LaTeX. Finally we obtain the presentation in .pdf format from the LaTeX source using pdflatex.
  • 6. Features of the parser Grabs all the tags and their relevant text in a well formatted html page. Classifies the text into proper sections. Like: Intelligently detects the type of tag and assigns proper at attribute to it. Output is out in as a well-formatted JSON array, so that it can be used independently in other applications. It also takes care of hyperlinks and includes them in the appropriate section/subsection.
  • 7. Features of the summarizer Summarizer takes as input a blob of text as input and outputs an array of sentences that summarizes the given text. The summarizer takes as input maximum number of sentences to be returned as output, so its flexible in this regard. Calculates the importance of a sentence by comparing it with all other sentences in the given text and assigning appropriate weights to all the words. Since were using stop-words from nltk(natural language toolkit), it can be extended to any natural language.
  • 8. Features of the PDF Generator It takes a JSON file as input, so it is independent in way that it can work with any JSON input, the only condition being that the JSON file must follow our standard format. If a particular section is appearing several times in the input with different contents, it automatically combines their content into one section. It generates the `tex` file before converting it to PDF so user can download the PDF. Also, he can edit the `tex` file itself if he wants so.
  • 9. Web Interface Weve also developed a web interface to use this tool in an easy manner: http://web.iiit.ac.in/~chandan.singh/html2presentation/
  • 10. Possible Use Cases Can be used to automatically generate presentations of ones paper. Can also be tweaked easily to summarize blogs. Since the code is written in a modular manner, more modules can easily be added or removed to enhance the user interface.
  • 11. Thank You! Any feedback/suggestions are welcome. You can contact us via our contact page: http://web.iiit.ac.in/~chandan. singh/html2presentation/team/