The document proposes leveraging the efficiencies of bitmap operations and vertical data formats to perform association rule mining in a distributed environment. It describes how a horizontal transactional dataset can be converted to a vertical bitmap structure by mapping each item to its transactions in a single pass. This vertical bitmap structure then facilitates efficient mining of frequent itemsets. The algorithm is amenable to distributed processing by mapping each item index to separate processes for parallel mining, followed by result aggregation. Further optimizations like using C instead of Java and bitmap compression techniques are suggested to improve performance.
1 of 29
Downloaded 18 times
More Related Content
Horizontal format data mining with extended bitmaps
2. Question?
Is it possible to leverage benefits of
vertical data formats in combination
with efficiencies of bitmap operations
to mine association rules in a
distributed environment.
3. Association Rule Mining??
Finding Interesting Relationships
between the variables.
Finding the subset that is common to a
chosen minimum number of the
itemsets from the set of itemsets.
Pattern Recognition.
Explained By Market Basket Analysis.
4. Sample (Toy ) Data
Set
TID Item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I1, I2
T400 I2, I5
5. Apriori
Fundamental Algorithm for Association
Rule Mining.
Mines frequent patterns from a horizontal
data format which represents the items
categorized into particular transactions.
i-th stage identifies all frequent i-element
sets.
Two steps:
> Candidate generation.
> Candidate counting.
6. Vertical Form
Transactions categorized into particular items.
Vertical format data mining only has to parse
the dataset once to get the itemsets.
For the itemset generation from the 2nd
itemset it only needs to refer the previous
itemset.
Eliminates parsing through the dataset each
time to count the frequency of itemsets, for
each round.
More efficient than its horizontal form.
7. BitMaps
Compactly store individual bits.
Exploit bit-level parallelism effectively.
0s and 1s.
1 indicates existence.
8. Combined?
Algorithm takes a horizontal data set.
With a one pass of the data set
construct a bit map based data
structure.
This structure is in vertical format.
The structure facilitates efficient mining
of association rules.
9. Sample (Toy ) Data
Set
TID Item IDs
T100 I1, I2, I5
T200 I2, I4
T300 I1, I2
T400 I2, I5
10. Sample (Toy ) Data
Set
TID Item IDs
T100 I1, I2, I5
Horizontal
Format T200 I2, I4
T300 I1, I2
T400 I2, I5
26. Insights
The algorithm performs better than
Apriori in most scenarios.
Data structure generation dominates
the total time in most cases.
As an aside
Can this be made to a distributed
mining algorithm?
27. Turns out this can be done rather easily.
Algorithm lends to map reduce like
distributed processing..
Each master array index is self
contained.. I1 I2 I5
2
1 1
1 0
So can be mined in parallel.
Data structure generation Map phase
Result accumulation -> Reduce phase
28. What Does Future Hold?
Make this distributed.
Java not the best of options. Use C so
we can control memory allocations the
way we want.
Experiment with bitmap compression
techniques.