1) The document describes SoftNews, a distributed solution for acquiring online news articles from over 30,000 sources in 20 languages.
2) SoftNews uses Perl modules to fetch, filter, compare, transform and store large numbers of news articles at set intervals while respecting time constraints. The articles are indexed using KinoSearch for fast retrieval.
3) A control GUI allows configuration and monitoring of the acquisition process. Delivered content is presented through a "Stich&glue" portal that provides enhanced search, tagging and visualization features for the large text collection.
2. Overview
Not a too technical talk
(...wait while nerds move to next room...)
I will present a complete solution =>
focus on architecture
Two months from inception to
deployment
A good example of using perl in business
applications, leveraging a lot from CPAN
3. Some background
In years 2006-2008 we developed a
distributed solution for tracking
advertisements, built completely in Perl.
In June we had the idea to exploit results
on the online news delivery market...
5. www.newsnow.co.uk
24/7 coverage of 33932+ sources in 20 languages from 141
countries
TV news websites
Online magazines and newswires
Delivery options:
Within minutes of publication, on a fully-branded secure Client
Portal
Searchable 30-day archive and 'drill-down' facility (with Client
Portal)
Search options
Match articles only when given keywords occur within the same
sentence, clause, paragraph or article
Reject articles that come from the wrong sources, are in the
wrong subject areas, or that specify irrelevant keywords or
phrases
Match 1, 10 or 100s of keywords and phrases simultaneously
6. www.newsnow.co.uk
...They do a lot of things...
...We started 2 months ago so => much
less...
...Still...
...We do it better...
7. SoftNews
Distributed acquisition
Grabbing Phases (fetching, filtering,
comparing, transforming) strictly
decoupled
Leverages on top of very powerful CPAN
libraries
A Stich&glue delivery portal with
already enhanced features
8. SoftNews: main goals
AcquisitionStoringDelivery
Topic oriented acquisition
Scalable
High accuracy (negligible false positives)
Fast text indexing of massive data
collection
NLP/Text processing techniques (stemming,
positive/negative mentioning,...
Pluggable, customizable services
Tag search, text highlight
Visual aids (Tag clouds, graph trends,..)
10. SoftNews: main issues
Many sites monitored at fixed intervals
Polling time must be respected in time-critical
domains
Should run with limited hardware/network
resources
Large number of documents
need fast indexing for retrieval
provide the user with tools to conveniently
navigate the text collection
13. Softnews: Grabbing
Look for something (=> Processor)
Reject rubbish (=> Filter handlers)
Remember what already has (=> Comparer)
Rely on MediaCampaign internet grabber
architecture
14. Acquisition: time constraints
Fetching process:
Strict time constraints
Network latency
Comparing and filtering processes
Loose time constraints
Lightweight
Transforming process:
Loose time constraints
May be an heavyweight process
Currently applied only for Flash animations
15. SoftNews: acq deployment opts
Go simple: One processing chain for each polling
interval
Fetcher Comparer Filter Transfor
m
10
mins
Fetcher Comparer Filter Transfor
m
12
hour
s
.
.
.
.
.
Queue Queue Queue
Queue Queue Queue
(US Polls news)
~150 web sites 1 month: > 300.000 ...... ~
18. Filtering...
Word-pattern based retrieval
The more words provided, the more accurate results will be
The need for speed
More pages processed with a faster search
Fully configurable
Deal with different topics and different web page layouts
Exploits KinoSearchranking
features
19. KinoSearch
What is KinoSearch
Text search engine library
A specialized and lightweight DBMS good for one thing:
fast search, ranked by relevance
Loose port of Apache Lucene
20. KinoSearch:features
Can handle millions of documents
Assigns each document a score, based on found
keywords
Advanced features Normalizer
Case-insensitive-search
Horses => horses
Tokenizer
Split text into tokens
癌shoots and leaves => shoots|leaves
Stemmer
Normalize word endings
horse, horses, horsing, horsed => hors
22. Delivery...
Leverages on top of Eadt, an MDE platform
Took us 3 days from design to deployment...
Lets have a look !
23. Conclusions
CPAN is full of gems
Perl provided to be the best solution for
spidering, text processing, indexing,...
Some (and sane) Perl hacking on holiday
may not be too bad...
Thank you !