際際滷

際際滷Share a Scribd company logo
Prepared By:- Group No. 27
Ashrith Jalagam(201202126)
Shefali Soni(201405619)
Aditya Lunawat(201405559)
Mentored By : Litton J Kurisinkel
 Document Summarizer is a platform used to generate the
summaries using pre-defined summarizers and get the
most relevant summary by passing it to a model.
 The relevancy of a document with respect to Computer
Science is determined using WordToVec model and get the
most relevant summary out of it.
 Various pre-built systems such as Apache-tika, WordToVec
models have been used for buiding the platform. This
platfrom can further be used by other developers.
 Several summarizers makes it difficult to judge which
summarizer suits the best for a scenario.
 Ability of the platform to test different summarizers
based on a domain helps the developers to make a
choice.
 This can be achieved by rating the documents based
on their relevancy achieved.
 Crawl the data and create a corpus of related to
Computer Science domain and create a model using
WordToVec tool.
 Given a URL/file, extract the textual content and create
a summary using different summarizers.
 Pass the summaries one by one to the WordToVec
model and get the relevancy of the summaries with
respect to computer science.
Document Summarizer
Document Summarizer
Corpus Creation
Text Extraction
Summary Generation
Relevancy Calculation
Document Summarizer
 Define a crawler that will crawl through the Dmoz
website and get the desired data.
 Get the wikipedia pages of all of these keywords and
store them in a text file which is the corpus of our
system.
 The wiki pages are being accessed using the Apache-
tika tool to get the pages.
 Input for the system can
be an URL or any type of file
such as pdf, excel, odt, odp
etc.These type of files must
be converted to text file for
the summarizers to manipulate.
This work is done using Apache-tika tool. Read the
input from either the URL or the file, pass it to Apache-
tika API and collect the output stream and write it to a
file.
 Four Different Summarizers were used to generate the
summary for each parsed text document/URL.
 Summarizer 1 : This Summarizer simply tokenizes the
given document and splits it into sentences. Then, it
calculates the rank of each sentence according to the TF-
IDF Model.
 Summarizer 2 : This Summarizer is similar to the
previous one but has a min and a max threshold. So,
only those sentences are considered which lie in that
range.
 Summarizer 3/4 : In these summarizers, there is an
inbuilt tokenizer and stemmer, uses help of nltk to
rank the final sentences.
 Summarizer 5 : This summarizer is the Open Text
Summarizer. This summarizer gives us the best
relevant results based on the summary ratio we
provide to it as input.
Document Summarizer
 There are a available set of summarizers added to
the system and more summarizers can be added to
the framework.
 User chooses among the available summarizers and
generate the summary.
 These summaries are being forwarded to the model
for relevancy calculation
Document Summarizer
 The input to the model is the textual
summary from all the summarizers. Pass the
summary one by one to the model.
 Based on certain parameters the model gives
the relevancy factor as the output to all the
summaries.
 Based on this factor the user decides, which
summary suits the most to the domain.
 News Feed (Relevancy based on searched category)
which means analysing the news and displaying only
the summary of the news rather than displaying the
whole content.
 Developed as a platform for the researchers working
on summarization as they can add new features to
this project.
 The project has been developed as a platform into
which new summarizers can easily be added.
 Ease for developers to decide which summarizer
works best for their domain by testing their data on
the summaries and calculating the relevance factor.
 Now the file factor is not the point for the
developers to think. Input any type of file or URL to
the platform.
 Open Url Directory For Computer Science
(http://www.dmoz.org/Computers/Computer_Science
 WORD2VEC model
Link: http://radimrehurek.com/gensim/index.html
 Summarizers
 http://glowingpython.blogspot.in/2014/09/text-
summarization-with-nltk.html
 https://pypi.python.org/pypi/sumy/0.3.0
 http://pythonwise.blogspot.in/2008/01/simple-text-
summarizer.html
Document Summarizer

More Related Content

Document Summarizer

  • 1. Prepared By:- Group No. 27 Ashrith Jalagam(201202126) Shefali Soni(201405619) Aditya Lunawat(201405559) Mentored By : Litton J Kurisinkel
  • 2. Document Summarizer is a platform used to generate the summaries using pre-defined summarizers and get the most relevant summary by passing it to a model. The relevancy of a document with respect to Computer Science is determined using WordToVec model and get the most relevant summary out of it. Various pre-built systems such as Apache-tika, WordToVec models have been used for buiding the platform. This platfrom can further be used by other developers.
  • 3. Several summarizers makes it difficult to judge which summarizer suits the best for a scenario. Ability of the platform to test different summarizers based on a domain helps the developers to make a choice. This can be achieved by rating the documents based on their relevancy achieved.
  • 4. Crawl the data and create a corpus of related to Computer Science domain and create a model using WordToVec tool. Given a URL/file, extract the textual content and create a summary using different summarizers. Pass the summaries one by one to the WordToVec model and get the relevancy of the summaries with respect to computer science.
  • 7. Corpus Creation Text Extraction Summary Generation Relevancy Calculation
  • 9. Define a crawler that will crawl through the Dmoz website and get the desired data. Get the wikipedia pages of all of these keywords and store them in a text file which is the corpus of our system. The wiki pages are being accessed using the Apache- tika tool to get the pages.
  • 10. Input for the system can be an URL or any type of file such as pdf, excel, odt, odp etc.These type of files must be converted to text file for the summarizers to manipulate. This work is done using Apache-tika tool. Read the input from either the URL or the file, pass it to Apache- tika API and collect the output stream and write it to a file.
  • 11. Four Different Summarizers were used to generate the summary for each parsed text document/URL. Summarizer 1 : This Summarizer simply tokenizes the given document and splits it into sentences. Then, it calculates the rank of each sentence according to the TF- IDF Model. Summarizer 2 : This Summarizer is similar to the previous one but has a min and a max threshold. So, only those sentences are considered which lie in that range.
  • 12. Summarizer 3/4 : In these summarizers, there is an inbuilt tokenizer and stemmer, uses help of nltk to rank the final sentences. Summarizer 5 : This summarizer is the Open Text Summarizer. This summarizer gives us the best relevant results based on the summary ratio we provide to it as input.
  • 14. There are a available set of summarizers added to the system and more summarizers can be added to the framework. User chooses among the available summarizers and generate the summary. These summaries are being forwarded to the model for relevancy calculation
  • 16. The input to the model is the textual summary from all the summarizers. Pass the summary one by one to the model. Based on certain parameters the model gives the relevancy factor as the output to all the summaries. Based on this factor the user decides, which summary suits the most to the domain.
  • 17. News Feed (Relevancy based on searched category) which means analysing the news and displaying only the summary of the news rather than displaying the whole content. Developed as a platform for the researchers working on summarization as they can add new features to this project.
  • 18. The project has been developed as a platform into which new summarizers can easily be added. Ease for developers to decide which summarizer works best for their domain by testing their data on the summaries and calculating the relevance factor. Now the file factor is not the point for the developers to think. Input any type of file or URL to the platform.
  • 19. Open Url Directory For Computer Science (http://www.dmoz.org/Computers/Computer_Science WORD2VEC model Link: http://radimrehurek.com/gensim/index.html Summarizers http://glowingpython.blogspot.in/2014/09/text- summarization-with-nltk.html https://pypi.python.org/pypi/sumy/0.3.0 http://pythonwise.blogspot.in/2008/01/simple-text- summarizer.html