際際滷

際際滷Share a Scribd company logo
Internet News
Retrieval
Marco Masetti (grubert)
masetti@linux.it
Overview
 Not a too technical talk
 (...wait while nerds move to next room...)
 I will present a complete solution =>
focus on architecture
 Two months from inception to
deployment
 A good example of using perl in business
applications, leveraging a lot from CPAN
Some background
 In years 2006-2008 we developed a
distributed solution for tracking
advertisements, built completely in Perl.
 In June we had the idea to exploit results
on the online news delivery market...
Some (two) months ago...
Had a look at this: www.newsnow.co.uk
www.newsnow.co.uk
 24/7 coverage of 33932+ sources in 20 languages from 141
countries
 TV news websites
 Online magazines and newswires
 Delivery options:
 Within minutes of publication, on a fully-branded secure Client
Portal
 Searchable 30-day archive and 'drill-down' facility (with Client
Portal)
 Search options
 Match articles only when given keywords occur within the same
sentence, clause, paragraph or article
 Reject articles that come from the wrong sources, are in the
wrong subject areas, or that specify irrelevant keywords or
phrases
 Match 1, 10 or 100s of keywords and phrases simultaneously
www.newsnow.co.uk
...They do a lot of things...
...We started 2 months ago so => much
less...
...Still...
...We do it better...
SoftNews
 Distributed acquisition
 Grabbing Phases (fetching, filtering,
comparing, transforming) strictly
decoupled
 Leverages on top of very powerful CPAN
libraries
 A Stich&glue delivery portal with
already enhanced features
SoftNews: main goals
AcquisitionStoringDelivery
 Topic oriented acquisition
 Scalable
 High accuracy (negligible false positives)
 Fast text indexing of massive data
collection
 NLP/Text processing techniques (stemming,
positive/negative mentioning,...
 Pluggable, customizable services
 Tag search, text highlight
 Visual aids (Tag clouds, graph trends,..)
SoftNews: domains tackled
 (EURO2008) European Soccer
Championship
 US Elections
SoftNews: main issues
 Many sites monitored at fixed intervals
 Polling time must be respected in time-critical
domains
 Should run with limited hardware/network
resources
 Large number of documents
 need fast indexing for retrieval
 provide the user with tools to conveniently
navigate the text collection
Workflow
Acquiring Storing Delivering
Acquisition
fetcher
sites circular queue
config
archives
filter comparer transformer
archives archives
Fetching Filtering Comparing Transforming
archives
config config config
YAML
WWW::Mechanize
HTML::Parser
XML::Simple
Net::FTP
Archive::Zip
Log::Log4perl
KinoSearch Cache::Cache
MD5::Digest
GD
Imager
SWF
artifacts
metadata
archive
Softnews: Grabbing
 Look for something (=> Processor)
 Reject rubbish (=> Filter handlers)
 Remember what already has (=> Comparer)
Rely on MediaCampaign internet grabber
architecture
Acquisition: time constraints
Fetching process:
Strict time constraints
Network latency
Comparing and filtering processes
Loose time constraints
Lightweight
Transforming process:
Loose time constraints
May be an heavyweight process
Currently applied only for Flash animations
SoftNews: acq deployment opts
Go simple: One processing chain for each polling
interval
Fetcher Comparer Filter Transfor
m
10
mins
Fetcher Comparer Filter Transfor
m
12
hour
s
.
.
.
.
.
Queue Queue Queue
Queue Queue Queue
(US Polls news)
~150 web sites  1 month: > 300.000 ...... ~
Softnews: acq deployment opts
Massive acquisition: remote distributed fetchers/filters...
Fetcher
Comparer
Filter
Transformer
Fetcher
.
.
.
.
.
Filter
Configuring and monitoring
acquisition...
 A control GUI is provided, controlling all activities. YAML
Prima
Filtering...
 Word-pattern based retrieval
The more words provided, the more accurate results will be
 The need for speed
More pages processed with a faster search
 Fully configurable
Deal with different topics and different web page layouts
Exploits KinoSearchranking
features
KinoSearch
 What is KinoSearch
Text search engine library
A specialized and lightweight DBMS good for one thing:
fast search, ranked by relevance
Loose port of Apache Lucene
KinoSearch:features
 Can handle millions of documents
 Assigns each document a score, based on found
keywords
 Advanced features Normalizer
Case-insensitive-search
Horses => horses
 Tokenizer
Split text into tokens
癌shoots and leaves => shoots|leaves
 Stemmer
Normalize word endings
horse, horses, horsing, horsed => hors
Storing...
KinoSearch
DBMS
MySQL
DBMS
KinoSearch
DBMS
loader
archives
YAML
Archive::Zip
XML::Simple
KinoSearch
Lingua::EN::Keywords
Lingua::EN::Tagger
Log::Log4perl
Delivery...
 Leverages on top of Eadt, an MDE platform
 Took us 3 days from design to deployment...
 Lets have a look !
Conclusions
 CPAN is full of gems
 Perl provided to be the best solution for
spidering, text processing, indexing,...
 Some (and sane) Perl hacking on holiday
may not be too bad...
Thank you !

More Related Content

SoftNews-lowres

  • 1. Internet News Retrieval Marco Masetti (grubert) masetti@linux.it
  • 2. Overview Not a too technical talk (...wait while nerds move to next room...) I will present a complete solution => focus on architecture Two months from inception to deployment A good example of using perl in business applications, leveraging a lot from CPAN
  • 3. Some background In years 2006-2008 we developed a distributed solution for tracking advertisements, built completely in Perl. In June we had the idea to exploit results on the online news delivery market...
  • 4. Some (two) months ago... Had a look at this: www.newsnow.co.uk
  • 5. www.newsnow.co.uk 24/7 coverage of 33932+ sources in 20 languages from 141 countries TV news websites Online magazines and newswires Delivery options: Within minutes of publication, on a fully-branded secure Client Portal Searchable 30-day archive and 'drill-down' facility (with Client Portal) Search options Match articles only when given keywords occur within the same sentence, clause, paragraph or article Reject articles that come from the wrong sources, are in the wrong subject areas, or that specify irrelevant keywords or phrases Match 1, 10 or 100s of keywords and phrases simultaneously
  • 6. www.newsnow.co.uk ...They do a lot of things... ...We started 2 months ago so => much less... ...Still... ...We do it better...
  • 7. SoftNews Distributed acquisition Grabbing Phases (fetching, filtering, comparing, transforming) strictly decoupled Leverages on top of very powerful CPAN libraries A Stich&glue delivery portal with already enhanced features
  • 8. SoftNews: main goals AcquisitionStoringDelivery Topic oriented acquisition Scalable High accuracy (negligible false positives) Fast text indexing of massive data collection NLP/Text processing techniques (stemming, positive/negative mentioning,... Pluggable, customizable services Tag search, text highlight Visual aids (Tag clouds, graph trends,..)
  • 9. SoftNews: domains tackled (EURO2008) European Soccer Championship US Elections
  • 10. SoftNews: main issues Many sites monitored at fixed intervals Polling time must be respected in time-critical domains Should run with limited hardware/network resources Large number of documents need fast indexing for retrieval provide the user with tools to conveniently navigate the text collection
  • 12. Acquisition fetcher sites circular queue config archives filter comparer transformer archives archives Fetching Filtering Comparing Transforming archives config config config YAML WWW::Mechanize HTML::Parser XML::Simple Net::FTP Archive::Zip Log::Log4perl KinoSearch Cache::Cache MD5::Digest GD Imager SWF artifacts metadata archive
  • 13. Softnews: Grabbing Look for something (=> Processor) Reject rubbish (=> Filter handlers) Remember what already has (=> Comparer) Rely on MediaCampaign internet grabber architecture
  • 14. Acquisition: time constraints Fetching process: Strict time constraints Network latency Comparing and filtering processes Loose time constraints Lightweight Transforming process: Loose time constraints May be an heavyweight process Currently applied only for Flash animations
  • 15. SoftNews: acq deployment opts Go simple: One processing chain for each polling interval Fetcher Comparer Filter Transfor m 10 mins Fetcher Comparer Filter Transfor m 12 hour s . . . . . Queue Queue Queue Queue Queue Queue (US Polls news) ~150 web sites 1 month: > 300.000 ...... ~
  • 16. Softnews: acq deployment opts Massive acquisition: remote distributed fetchers/filters... Fetcher Comparer Filter Transformer Fetcher . . . . . Filter
  • 17. Configuring and monitoring acquisition... A control GUI is provided, controlling all activities. YAML Prima
  • 18. Filtering... Word-pattern based retrieval The more words provided, the more accurate results will be The need for speed More pages processed with a faster search Fully configurable Deal with different topics and different web page layouts Exploits KinoSearchranking features
  • 19. KinoSearch What is KinoSearch Text search engine library A specialized and lightweight DBMS good for one thing: fast search, ranked by relevance Loose port of Apache Lucene
  • 20. KinoSearch:features Can handle millions of documents Assigns each document a score, based on found keywords Advanced features Normalizer Case-insensitive-search Horses => horses Tokenizer Split text into tokens 癌shoots and leaves => shoots|leaves Stemmer Normalize word endings horse, horses, horsing, horsed => hors
  • 22. Delivery... Leverages on top of Eadt, an MDE platform Took us 3 days from design to deployment... Lets have a look !
  • 23. Conclusions CPAN is full of gems Perl provided to be the best solution for spidering, text processing, indexing,... Some (and sane) Perl hacking on holiday may not be too bad... Thank you !