The document discusses building a real-time search engine for log data. It describes using Flume to collect streaming log data and write it to HDFS files. Fastcatsearch indexes the HDFS files in real-time by creating index segments, merging segments, and removing outdated segments to make data searchable in real-time. The system aims to provide fast indexing and querying of large and continuous log data streams like Splunk.
5. Goal
Like Splunk
Indexing streaming log data
Search log data in real-time
5
6. Big data
Data sets so large and complex for database
Difficult to process them using traditional data processing
3Vs
Volume : Large quantity of data
Variety : Diverse set of data
Velocity : speed of data
豢豌 : wikipedia
6
7. About Fastcatsearch
Distributed system
Fast indexing
Fast queries
Popular keyword
GS cetification
70+ references
Open source
Muti-platform
Easy web management
tool
Dictionary management
Plugin extension
7
9. History
Fastcatsearch v1 (2010-2011)
Single machine
<150 QPS
Fastcatsearch v2 (2013-Now)
Distributed system
Multi collection result aggregation
>200+ Query per second
Fastcatsearch v3 (alpha)
Realtime indexing/searching
Schema-free
Shard/replica
Geo spatial search
9
17. Fastcatsearch
HDFS Indexer
Merger
SSeSegegmgmmeenentnt t Searcher
Index File
Issue
- Segment file commit
- Doc deletion
17
18. Import using Flume
1. FileSystem fs = FileSystem.get(URI.create(uriPath), conf);
2. Configuration conf = fs.getConf();
3. FileStatus[] status = fs.listStatus(new Path(dirPath));
4. SequenceFile.Reader.Option opt = SequenceFile.Reader.file(status[i].getPath());
5. for (int i = 0; i < status.length; i++) {
6. SequenceFile.Reader reader = new SequenceFile.Reader(conf, opt);
7. Writable key = (Writable) ReflectionUtils.newInstance(
reader.getKeyClass(), conf);
8. Writable value = (Writable) ReflectionUtils.newInstance(
reader.getValueClass(), conf);
9. while (reader.next(key, value)) {
10. Map<String, Object> parsedEvent = parseEvent(key.toString(),
value.toString());
11. if (parsedEvent != null) {
12. eventQueue.add(parsedEvent);
}
}
}
18
19. Making index segment
Index has multiple segments
Document writer
Index writer
Search index writer
Field index writer
Group index writer
19
20. Segment commit issue
Update / Delete documents
Not in-place update
Append and delete operation
Deletion for previous segments
Mark as deleted
20
21. Segment merge issue
Performance
2(n+m) in time and space
Size Compaction - Deleted docs removed.
segment #1 segment #2 segment #3
segment #4
merge to new segment
21
22. Segment merge issue
Why merge?
Segment count grows fast
Search index = Search all leaf segments in turn
Document deletion
22
23. Inverted Indexing
Posting index term1
term3 term5 term7
Postings
term1 posting1 term2 posting2 term3 posting3
term4 posting4 term5 posting5 term6 posting6
Good for sequential writing to disk
23
24. Inverted Indexing
How about b tree?
block
block block block
Memory
block block block block block block
block block block block block block block block
Flush occurs much of data random writing to disk
File
24
25. Search in realtime
seg #1 seg #2 seg #3 seg #4
1. New created segment
Searchable data
25
26. Search in realtime
seg #1 seg #2 seg #3 seg #4
2. Merge segments
Searchable data
26
27. Search in realtime
seg #1 seg #2 seg #3 seg #4 seg #5
4. Remove segments
3. New merged segment
Searchable data
27