�ݺ�ߣ

Realtimestream and
realtime search engine
Sang Song
fastcatsearch.org
1

About Me
• www.linkedin.com/profile/view?id=295484775
• facebook.com/songaal
• swsong@websqrd.com
2

Agenda
• Introduction
• Search Architecture
• Realtime Indexing
3

Goal
• Like Splunk
• Indexing streaming log data
• Search log data in real-time
5

Big data
• Data sets so large and complex for database
• Difficult to process them using traditional data processing
• 3Vs
• Volume : Large quantity of data
• Variety : Diverse set of data
• Velocity : speed of data
출처 : wikipedia
6

About Fastcatsearch
• Distributed system
• Fast indexing
• Fast queries
• Popular keyword
• GS cetification
• 70+ references
• Open source
• Muti-platform
• Easy web management
tool
• Dictionary management
• Plugin extension
7

History
• Fastcatsearch v1 (2010-2011)
• Single machine
• <150 QPS
• Fastcatsearch v2 (2013-Now)
• Distributed system
• Multi collection result aggregation
• >200+ Query per second
• Fastcatsearch v3 (alpha)
• Realtime indexing/searching
• Schema-free
• Shard/replica
• Geo spatial search
9

Store log data
• HDFS
• Write once static file
• Flume
• Collecting, aggregating, and moving large amounts of
log data
13

Flume config
agent1.sources = r1
agent1.sinks = hdfssink
agent1.channels = c1
agent1.sources.r1.type = netcat
agent1.sources.r1.bind = localhost
agent1.sources.r1.port = 44443
agent1.sinks.hdfssink.type = hdfs
agent1.sinks.hdfssink.hdfs.path = hdfs://192.168.189.173:9000/flume/events
agent1.sinks.hdfssink.hdfs.file.Type = SequenceFile #DataStream
agent1.sinks.hdfssink.hdfs.writeFormat = Text
agent1.sinks.hdfssink.hdfs.batchSize= 10
agent1.channels.c1.type = memory
agent1.channels.c1.capacity = 1000
agent1.channels.c1.transactionCapacity = 100
agent1.sources.r1.channels = c1
agent1.sinks.hdfssink.channel = c1
$ ./flume-ng agent -f /home/swsong/flume/conf/flume.conf -n agent1
15

Fastcatsearch
HDFS Indexer
Merger
SSeSegegmgmmeenentnt t Searcher
Index File
Issue
- Segment file commit
- Doc deletion
17

Import using Flume
1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf);
2. Configuration conf = fs.getConf();
3. FileStatus[] status = fs.listStatus(new Path(dirPath));
4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath());
5. for (int i = 0; i < status.length; i++) {
6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt);
7. Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
8. Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
9. while (reader.next(key, value)) {
10. Map<String, Object> parsedEvent = parseEvent(key.toString(),
value.toString());
11. if (parsedEvent != null) {
12. eventQueue.add(parsedEvent);
}
}
}
18

Making index segment
• Index has multiple segments
• Document writer
• Index writer
• Search index writer
• Field index writer
• Group index writer
19

Segment commit issue
• Update / Delete documents
• Not in-place update
• Append and delete operation
• Deletion for previous segments
• Mark as deleted
20

Segment merge issue
• Performance
• 2(n+m) in time and space
• Size Compaction - Deleted docs removed.
segment #1 segment #2 segment #3
segment #4
merge to new segment
21

Segment merge issue
• Why merge?
• Segment count grows fast
• Search index = Search all leaf segments in turn
• Document deletion
22

Inverted Indexing
Posting index term1
term3 term5 term7
Postings
term1 posting1 term2 posting2 term3 posting3
term4 posting4 term5 posting5 term6 posting6
Good for sequential writing to disk
23

Inverted Indexing
How about b tree?
block
block block block
Memory
block block block block block block
block block block block block block block block
Flush occurs much of data random writing to disk
File
24

Search in realtime
seg #1 seg #2 seg #3 seg #4
1. New created segment
Searchable data
25

Search in realtime
seg #1 seg #2 seg #3 seg #4
2. Merge segments
Searchable data
26

Search in realtime
seg #1 seg #2 seg #3 seg #4 seg #5
4. Remove segments
3. New merged segment
Searchable data
27

Search in realtime
Searchable data
seg #1 seg #5
5. Searching data
28

Search in realtime
Searchable data
seg #1 seg #5
seg #6
New created segment
Do this process constantly
29

Visualization
• Lucene's merge visualization
• http://www.youtube.com/watch?v=ojcpvIY3QgA
• Python script + Python Image Library + MEncoder
30

Learn More
• http://fastcatsearch.org/
• https://www.facebook.com/groups/fastcatsearch/
32

�ݺ�ߣ

Realtimestream and realtime fastcatsearch

More Related Content

Realtimestream and realtime fastcatsearch