際際滷

際際滷Share a Scribd company logo
Realtimestream and 
realtime search engine 
Sang Song 
fastcatsearch.org 
1
About Me 
 www.linkedin.com/profile/view?id=295484775 
 facebook.com/songaal 
 swsong@websqrd.com 
2
Agenda 
 Introduction 
 Search Architecture 
 Realtime Indexing 
3
Introduction 
4
Goal 
 Like Splunk 
 Indexing streaming log data 
 Search log data in real-time 
5
Big data 
 Data sets so large and complex for database 
 Difficult to process them using traditional data processing 
 3Vs 
 Volume : Large quantity of data 
 Variety : Diverse set of data 
 Velocity : speed of data 
豢豌 : wikipedia 
6
About Fastcatsearch 
 Distributed system 
 Fast indexing 
 Fast queries 
 Popular keyword 
 GS cetification 
 70+ references 
 Open source 
 Muti-platform 
 Easy web management 
tool 
 Dictionary management 
 Plugin extension 
7
Reference 
8
History 
 Fastcatsearch v1 (2010-2011) 
 Single machine 
 <150 QPS 
 Fastcatsearch v2 (2013-Now) 
 Distributed system 
 Multi collection result aggregation 
 >200+ Query per second 
 Fastcatsearch v3 (alpha) 
 Realtime indexing/searching 
 Schema-free 
 Shard/replica 
 Geo spatial search 
9
Search Architecture 
10
11
Realtime Indexing 
12
Store log data 
 HDFS 
 Write once static file 
 Flume 
 Collecting, aggregating, and moving large amounts of 
log data 
13
14
Flume config 
agent1.sources = r1 
agent1.sinks = hdfssink 
agent1.channels = c1 
agent1.sources.r1.type = netcat 
agent1.sources.r1.bind = localhost 
agent1.sources.r1.port = 44443 
agent1.sinks.hdfssink.type = hdfs 
agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events 
agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream 
agent1.sinks.hdfssink.hdfs.writeFormat = Text 
agent1.sinks.hdfssink.hdfs.batchSize= 10 
agent1.channels.c1.type = memory 
agent1.channels.c1.capacity = 1000 
agent1.channels.c1.transactionCapacity = 100 
agent1.sources.r1.channels = c1 
agent1.sinks.hdfssink.channel = c1 
$ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1 
15
Flume append? 
16
Fastcatsearch 
HDFS Indexer 
Merger 
SSeSegegmgmmeenentnt t Searcher 
Index File 
Issue 
- Segment file commit 
- Doc deletion 
17
Import using Flume 
1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf); 
2. Configuration conf = fs.getConf(); 
3. FileStatus[] status = fs.listStatus(new Path(dirPath)); 
4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath()); 
5. for (int i = 0; i < status.length; i++) { 
6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt); 
7. Writable key = (Writable) ReflectionUtils.newInstance( 
reader.getKeyClass(), conf); 
8. Writable value = (Writable) ReflectionUtils.newInstance( 
reader.getValueClass(), conf); 
9. while (reader.next(key, value)) { 
10. Map<String, Object> parsedEvent = parseEvent(key.toString(), 
value.toString()); 
11. if (parsedEvent != null) { 
12. eventQueue.add(parsedEvent); 
} 
} 
} 
18
Making index segment 
 Index has multiple segments 
 Document writer 
 Index writer 
 Search index writer 
 Field index writer 
 Group index writer 
19
Segment commit issue 
 Update / Delete documents 
 Not in-place update 
 Append and delete operation 
 Deletion for previous segments 
 Mark as deleted 
20
Segment merge issue 
 Performance 
 2(n+m) in time and space 
 Size Compaction - Deleted docs removed. 
segment #1 segment #2 segment #3 
segment #4 
merge to new segment 
21
Segment merge issue 
 Why merge? 
 Segment count grows fast 
 Search index = Search all leaf segments in turn 
 Document deletion 
22
Inverted Indexing 
Posting index term1 
term3 term5 term7 
Postings 
term1 posting1 term2 posting2 term3 posting3 
term4 posting4 term5 posting5 term6 posting6 
Good for sequential writing to disk 
23
Inverted Indexing 
How about b tree? 
block 
block block block 
Memory 
block block block block block block 
block block block block block block block block 
Flush occurs much of data random writing to disk 
File 
24
Search in realtime 
seg #1 seg #2 seg #3 seg #4 
1. New created segment 
Searchable data 
25
Search in realtime 
seg #1 seg #2 seg #3 seg #4 
2. Merge segments 
Searchable data 
26
Search in realtime 
seg #1 seg #2 seg #3 seg #4 seg #5 
4. Remove segments 
3. New merged segment 
Searchable data 
27
Search in realtime 
Searchable data 
seg #1 seg #5 
5. Searching data 
28
Search in realtime 
Searchable data 
seg #1 seg #5 
seg #6 
New created segment 
Do this process constantly 
29
Visualization 
 Lucene's merge visualization 
 http://www.youtube.com/watch?v=ojcpvIY3QgA 
 Python script + Python Image Library + MEncoder 
30
Questions? 
31
Learn More 
 http://fastcatsearch.org/ 
 https://www.facebook.com/groups/fastcatsearch/ 
32

More Related Content

Realtimestream and realtime fastcatsearch

  • 1. Realtimestream and realtime search engine Sang Song fastcatsearch.org 1
  • 2. About Me www.linkedin.com/profile/view?id=295484775 facebook.com/songaal swsong@websqrd.com 2
  • 3. Agenda Introduction Search Architecture Realtime Indexing 3
  • 5. Goal Like Splunk Indexing streaming log data Search log data in real-time 5
  • 6. Big data Data sets so large and complex for database Difficult to process them using traditional data processing 3Vs Volume : Large quantity of data Variety : Diverse set of data Velocity : speed of data 豢豌 : wikipedia 6
  • 7. About Fastcatsearch Distributed system Fast indexing Fast queries Popular keyword GS cetification 70+ references Open source Muti-platform Easy web management tool Dictionary management Plugin extension 7
  • 9. History Fastcatsearch v1 (2010-2011) Single machine <150 QPS Fastcatsearch v2 (2013-Now) Distributed system Multi collection result aggregation >200+ Query per second Fastcatsearch v3 (alpha) Realtime indexing/searching Schema-free Shard/replica Geo spatial search 9
  • 11. 11
  • 13. Store log data HDFS Write once static file Flume Collecting, aggregating, and moving large amounts of log data 13
  • 14. 14
  • 15. Flume config agent1.sources = r1 agent1.sinks = hdfssink agent1.channels = c1 agent1.sources.r1.type = netcat agent1.sources.r1.bind = localhost agent1.sources.r1.port = 44443 agent1.sinks.hdfssink.type = hdfs agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream agent1.sinks.hdfssink.hdfs.writeFormat = Text agent1.sinks.hdfssink.hdfs.batchSize= 10 agent1.channels.c1.type = memory agent1.channels.c1.capacity = 1000 agent1.channels.c1.transactionCapacity = 100 agent1.sources.r1.channels = c1 agent1.sinks.hdfssink.channel = c1 $ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1 15
  • 17. Fastcatsearch HDFS Indexer Merger SSeSegegmgmmeenentnt t Searcher Index File Issue - Segment file commit - Doc deletion 17
  • 18. Import using Flume 1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf); 2. Configuration conf = fs.getConf(); 3. FileStatus[] status = fs.listStatus(new Path(dirPath)); 4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath()); 5. for (int i = 0; i < status.length; i++) { 6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt); 7. Writable key = (Writable) ReflectionUtils.newInstance( reader.getKeyClass(), conf); 8. Writable value = (Writable) ReflectionUtils.newInstance( reader.getValueClass(), conf); 9. while (reader.next(key, value)) { 10. Map<String, Object> parsedEvent = parseEvent(key.toString(), value.toString()); 11. if (parsedEvent != null) { 12. eventQueue.add(parsedEvent); } } } 18
  • 19. Making index segment Index has multiple segments Document writer Index writer Search index writer Field index writer Group index writer 19
  • 20. Segment commit issue Update / Delete documents Not in-place update Append and delete operation Deletion for previous segments Mark as deleted 20
  • 21. Segment merge issue Performance 2(n+m) in time and space Size Compaction - Deleted docs removed. segment #1 segment #2 segment #3 segment #4 merge to new segment 21
  • 22. Segment merge issue Why merge? Segment count grows fast Search index = Search all leaf segments in turn Document deletion 22
  • 23. Inverted Indexing Posting index term1 term3 term5 term7 Postings term1 posting1 term2 posting2 term3 posting3 term4 posting4 term5 posting5 term6 posting6 Good for sequential writing to disk 23
  • 24. Inverted Indexing How about b tree? block block block block Memory block block block block block block block block block block block block block block Flush occurs much of data random writing to disk File 24
  • 25. Search in realtime seg #1 seg #2 seg #3 seg #4 1. New created segment Searchable data 25
  • 26. Search in realtime seg #1 seg #2 seg #3 seg #4 2. Merge segments Searchable data 26
  • 27. Search in realtime seg #1 seg #2 seg #3 seg #4 seg #5 4. Remove segments 3. New merged segment Searchable data 27
  • 28. Search in realtime Searchable data seg #1 seg #5 5. Searching data 28
  • 29. Search in realtime Searchable data seg #1 seg #5 seg #6 New created segment Do this process constantly 29
  • 30. Visualization Lucene's merge visualization http://www.youtube.com/watch?v=ojcpvIY3QgA Python script + Python Image Library + MEncoder 30
  • 32. Learn More http://fastcatsearch.org/ https://www.facebook.com/groups/fastcatsearch/ 32