際際滷

際際滷Share a Scribd company logo
Flag Aligned Word (FAW)-based Integers Compression
Presented by
Md Mostofa Kamal Rasel
Outline
 Introduction
 Related Works
 Motivation
 Problem Statement
 Proposed Technique
 Mostofa Kamal Rasel, and Young-Koo Lee, "An efficient subgraphs compression-based technique for reducing the I/O cost of join-based graph mining
algorithms, 7th International Conference on Emerging Databases Technologies, Applications, and Theory (EDB), Busan, South Korea, 2017.
 Mostofa Kamal Rasel and Young-Koo Lee, On-the-fly Output Compression for Join-based Graph Mining Algorithms , IEEE Access, IEEE (SCIE, IF: 4.098).
Introduction
 Sources of big graphs are very common
now-a-days
 Social networks
 Web graphs
 Biological/ecological networks
 Big graphs are represented using bitmap,
adjacency list or edge list
 Requires enormous space to store
 RDBMS also uses bitmap or inverted list to
represent indices, which are large
 Need to Compress these large data for 
 Reducing memory footprint
 Efficient computation
Related Works
 Word Aligned Hybrid (WAH) Compression [1]
 Represents a list of integers in bitmaps
 Divides bitmap into 31-bits groups
 Encode the groups into 0-Fill, 1-Fill, and literal words
 Limitation
 Sometimes size of encoded data exceeds originals !!
 Run-length based compression, so random access not possible
 Roaring Bitmap Compression [2]
 Divides a list of integers into a range of 216
 Each group is identified by most significant 16-bits
 Remove the most significant 16-bits from the member of each group
 Two type of groups
 Array container: cardinality <= 2048
 Represents using 16-bit integers
 Bitmap container: cardinality > 2048
 Represents using 216
bits long bitmap
 Limitation
 Data type are not same before and after compression
1. Kesheng Wu, E. J. Otoo and A. Shoshani, "Compressing bitmap indexes for faster search operations," Proceedings 14th International Conference on Scientific and Statistical Database
Management, Edinburgh, Scotland, UK, 2002.
2. S. Chambi, D. Lemire, O. Kaser, and R. Godin, Better bitmap performance with roaring bitmaps, Software: Practice and Experience, vol. 46, pp. 709719, 2016.
Motivation
Reuse the memory of input data to store encoded data
Allocated memory
to holds array of
integers
Allocated
memory to
holds encoded
data
Traditional Encoding
Algorithm
 New memory allocation for holding
encoded data comes with cost
Allocated memory
to holds array of
integers
Encoding
Algorithm
 Put an encoded word in-place of an input word,
which is already considered for encoding
Access related encoded data only
 = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228]  O( )
37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228
Encoded Word, 0 Encoded Word, 1 Encoded Word, 2 Encoded Word, 3
Binary search to search an item
 O( 4)
Problem Statement
 Encode a list of integers, so that 
 Final output requires less space into memory
 Never exceeds the size of original data
 Does not require additional memory space to store the output
 Allows random access on encoded data
 Allows partial decoding
Proposed Technique
 S-word: represents a single integer
 M-word: represents multiple integers, where each high bit represents a single integer
 R-word: represents the run of consecutive integers,
 where   ≠ > word  length
 FM-word: represents a single integer and indicates that next word is a M-word.
 First bit high is set to 1
 FR-word: represents a single integer and indicates that next word is a R-word.
 First two bits are set to 1
37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228
Encoded Word, 0 Encoded Word, 1 Encoded Word, 2 Encoded Word, 3
 Encode multiple integers into a single word
 Introduce a flag word in front of an encoded word
 Use an integer (from data) as a flag word to avoid overflow
FAW-based Encoding
Flag, 
 Word length,  = 
  = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251,  , 350]
1 1 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
45    1 = 7
46    1 = 8
65    1 = 27
< 
Encoded Word
Integers Encoded Words Word Type
Reused
Location
37 1000000000000000000000000100101 FM-word 0
45, 46, 65 0000000110000000000000000010000 M-word 1
FAW-based Encoding
Flag, 
 Word length,  = 
  = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251,  , 350]
1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Encoded Word
Integers Encoded Words Word Type
Reused
Location
37 1000000000000000000000000100101 FM-word 0
45, 46, 65 0000000110000000000000000010000 M-word 1
112 0000000000000000000000001110000 S-word 2
117 0000000000000000000000001110101 S-word 3
117    1 = 4 <
FAW-based Encoding
Flag, 
 Word length,  = 
  = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251,  , 350]
Integers Encoded Words Word Type
Reused
Location
37 1000000000000000000000000100101 FM-word 0
45, 46, 65 0000000110000000000000000010000 M-word 1
112 0000000000000000000000001110000 S-word 2
117 0000000000000000000000001110101 S-word 3
201 1000000000000000000000011001001 FM-word 4
207, 211, 225, 228 0000100010000000000001001000000 M-word 5
250 1100000000000000000000011111010 FR-word 6
251, , 350 0000000000000000000000001100100 R-word 7
251    1 = 0
252    1 = 1
. . .
350    1 = 99
> 
Run of consecutive integers is greater than word-length
Thank You

More Related Content

Flag aligned word (faw) based compression

  • 1. Flag Aligned Word (FAW)-based Integers Compression Presented by Md Mostofa Kamal Rasel
  • 2. Outline Introduction Related Works Motivation Problem Statement Proposed Technique Mostofa Kamal Rasel, and Young-Koo Lee, "An efficient subgraphs compression-based technique for reducing the I/O cost of join-based graph mining algorithms, 7th International Conference on Emerging Databases Technologies, Applications, and Theory (EDB), Busan, South Korea, 2017. Mostofa Kamal Rasel and Young-Koo Lee, On-the-fly Output Compression for Join-based Graph Mining Algorithms , IEEE Access, IEEE (SCIE, IF: 4.098).
  • 3. Introduction Sources of big graphs are very common now-a-days Social networks Web graphs Biological/ecological networks Big graphs are represented using bitmap, adjacency list or edge list Requires enormous space to store RDBMS also uses bitmap or inverted list to represent indices, which are large Need to Compress these large data for Reducing memory footprint Efficient computation
  • 4. Related Works Word Aligned Hybrid (WAH) Compression [1] Represents a list of integers in bitmaps Divides bitmap into 31-bits groups Encode the groups into 0-Fill, 1-Fill, and literal words Limitation Sometimes size of encoded data exceeds originals !! Run-length based compression, so random access not possible Roaring Bitmap Compression [2] Divides a list of integers into a range of 216 Each group is identified by most significant 16-bits Remove the most significant 16-bits from the member of each group Two type of groups Array container: cardinality <= 2048 Represents using 16-bit integers Bitmap container: cardinality > 2048 Represents using 216 bits long bitmap Limitation Data type are not same before and after compression 1. Kesheng Wu, E. J. Otoo and A. Shoshani, "Compressing bitmap indexes for faster search operations," Proceedings 14th International Conference on Scientific and Statistical Database Management, Edinburgh, Scotland, UK, 2002. 2. S. Chambi, D. Lemire, O. Kaser, and R. Godin, Better bitmap performance with roaring bitmaps, Software: Practice and Experience, vol. 46, pp. 709719, 2016.
  • 5. Motivation Reuse the memory of input data to store encoded data Allocated memory to holds array of integers Allocated memory to holds encoded data Traditional Encoding Algorithm New memory allocation for holding encoded data comes with cost Allocated memory to holds array of integers Encoding Algorithm Put an encoded word in-place of an input word, which is already considered for encoding Access related encoded data only = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228] O( ) 37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228 Encoded Word, 0 Encoded Word, 1 Encoded Word, 2 Encoded Word, 3 Binary search to search an item O( 4)
  • 6. Problem Statement Encode a list of integers, so that Final output requires less space into memory Never exceeds the size of original data Does not require additional memory space to store the output Allows random access on encoded data Allows partial decoding
  • 7. Proposed Technique S-word: represents a single integer M-word: represents multiple integers, where each high bit represents a single integer R-word: represents the run of consecutive integers, where ≠ > word length FM-word: represents a single integer and indicates that next word is a M-word. First bit high is set to 1 FR-word: represents a single integer and indicates that next word is a R-word. First two bits are set to 1 37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228 Encoded Word, 0 Encoded Word, 1 Encoded Word, 2 Encoded Word, 3 Encode multiple integers into a single word Introduce a flag word in front of an encoded word Use an integer (from data) as a flag word to avoid overflow
  • 8. FAW-based Encoding Flag, Word length, = = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251, , 350] 1 1 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 45 1 = 7 46 1 = 8 65 1 = 27 < Encoded Word Integers Encoded Words Word Type Reused Location 37 1000000000000000000000000100101 FM-word 0 45, 46, 65 0000000110000000000000000010000 M-word 1
  • 9. FAW-based Encoding Flag, Word length, = = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251, , 350] 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Encoded Word Integers Encoded Words Word Type Reused Location 37 1000000000000000000000000100101 FM-word 0 45, 46, 65 0000000110000000000000000010000 M-word 1 112 0000000000000000000000001110000 S-word 2 117 0000000000000000000000001110101 S-word 3 117 1 = 4 <
  • 10. FAW-based Encoding Flag, Word length, = = [37, 45, 46, 65, 112, 117, 201, 207, 211, 225, 228, 250, 251, , 350] Integers Encoded Words Word Type Reused Location 37 1000000000000000000000000100101 FM-word 0 45, 46, 65 0000000110000000000000000010000 M-word 1 112 0000000000000000000000001110000 S-word 2 117 0000000000000000000000001110101 S-word 3 201 1000000000000000000000011001001 FM-word 4 207, 211, 225, 228 0000100010000000000001001000000 M-word 5 250 1100000000000000000000011111010 FR-word 6 251, , 350 0000000000000000000000001100100 R-word 7 251 1 = 0 252 1 = 1 . . . 350 1 = 99 > Run of consecutive integers is greater than word-length

Editor's Notes

  1. How presentation will benefit audience: Adult learners are more interested in a subject if they know how or why it is important to them. Presenters level of expertise in the subject: Briefly state your credentials in this area, or explain why participants should listen to you.