際際滷

際際滷Share a Scribd company logo
Introduction to Accumulo
Mario Pastorelli
mario.pastorelli@teralytics.ch
March 7, 2016
1
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed 鍖lesystem
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed 鍖lesystem
MapReduce: distributed data processing
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed 鍖lesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
2
History
To accommodate their needs for analysis of large
amounts of data on commodity hardware, Google
developed three main distributed systems:
GFS: distributed 鍖lesystem
MapReduce: distributed data processing
BigTable: distributed storage system for
structured data
Accumulo is an open-source implementation of
BigTable
2
Distributed Structured Data
structured data should be
 distributed for parallel processing
 indexed for fast retrieval (structured means that it has
some kind of primary key)
 tabular for easy processing of complex data, each row can
potentially have many columns
3
Distributed Structured Data
structured data should be
 distributed for parallel processing
 indexed for fast retrieval (structured means that it has
some kind of primary key)
 tabular for easy processing of complex data, each row can
potentially have many columns
databases o鍖er indexes and tables but dont
scale without signi鍖cant e鍖ort
3
Distributed Structured Data
structured data should be
 distributed for parallel processing
 indexed for fast retrieval (structured means that it has
some kind of primary key)
 tabular for easy processing of complex data, each row can
potentially have many columns
databases o鍖er indexes and tables but dont
scale without signi鍖cant e鍖ort
key-value stores can easily be distributed but
have limited index support over keys and dont
have support for tabular format out of the box
3
Accumulo
Accumulo is a key-value store with support for
tabular data
 keys are columns identi鍖ers, i.e. they uniquely identify a
column of a row
 a row is composed by multiple keys-values grouped by the
pre鍖x of the key, the row id
4
Example
EMAIL NAME LASTNAME COMPANY
olismith85@gmail.com Olivia Smith Winsystems
emily.brown@facebook.com Emily Brown Jones Inc.

KEY (composed by row id and column id) VALUE
olismith85@gmail.comNAME Olivia
olismith85@gmail.comLASTNAME Smith
olismith85@gmail.comCOMPANY Winsystems
emily.brown@facebook.comNAME Emily
emily.brown@facebook.comLASTNAME Brown
emily.brown@facebook.comCOMPANY Jones Inc.
5
Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which column group the key belongs to
column quali鍖er: the column id
column visibility: who can access this column
timestamp: the version of the key
6
Composite Keys
Keys in Accumulo are composite and have the following components
row id: to which row the key belongs to
column family: to which column group the key belongs to
column quali鍖er: the column id
column visibility: who can access this column
timestamp: the version of the key
A single key-value is stored as
KEY
VALUE
row id
column
timestamp
family quali鍖er visibility
6
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query semantically close data
 e.g. temporal data can be stored such that aggregation of
close days is local and fast
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query semantically close data
 e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query semantically close data
 e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query semantically close data
 e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
7
Accumulo features
range queries: keys are stored in lexicographical order
allowing to query semantically close data
 e.g. temporal data can be stored such that aggregation of
close days is local and fast
fast: with proper key schemas a query can take
milliseconds
scalable: designed to store huge amount of data over
multiple tables
built-in cache for recently queried data
many others, such as bulk imports, iterators, fault
tolerance, large rows, multiple-batch queries, testing
utilities (mocks, miniclusters) . . .
7
Example
we want to store and analyze tweets from all around
the world.
8
Example: Tweets analysis
A tweet has the following (simpli鍖ed) 鍖elds
 coordinate: geospatial information composed by longitude
and latitude
 created at: UTC time of the tweet
 id: tweet unique identi鍖er
 user informations, such as
user.id: unique identi鍖er of the user
user.screen name: user name
. . .
 entities such as hashtags, urls. . .
 text: tweet content
 . . .
how do we store this data in Accumulo?
9
Example: Tweets analysis
there is no single way to do it, it depends on
the query
10
Example: Tweets analysis
there is no single way to do it, it depends on
the query
two good practices
 work with denormalized data
 specialize tables for each kind of query
10
Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family quali鍖er visibility
user.id + created at + id
coordinate lon/lat
entities
hashtags hashtags
urls urls
text text
Easy to process the entire timeline or a time
interval for the same user
11
Example: Twitter User Timeline
schema
KEY
VALUE
row id
column
timestamp
family quali鍖er visibility
user.id + created at + id
coordinate lon/lat
entities
hashtags hashtags
urls urls
text text
Easy to process the entire timeline or a time
interval for the same user
Not good for other kind of analysis
 鍖nd all the tweets with a given hashtag
 鍖nd all the tweets in New York
 . . .
11
Summary
Accumulo is great for storing large amount of
structured data
Accumulo is good for interactive queries as well
as more batch queries
Accumulo is a low-level system
 NoSQL (thats not good!), which means no high-level
language to query the data
 a lot of 鍖exibility which can easily back鍖re
12
Thank you
Questions?
13

More Related Content

Similar to Introduction to Accumulo (20)

Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
xu liwei
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
Bigtable_Paper
Bigtable_PaperBigtable_Paper
Bigtable_Paper
Tarun Kumar Sarkar
rdbms-notes
rdbms-notesrdbms-notes
rdbms-notes
Mohit Saini
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Ashnikbiz
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
Miguel Bosin
R data structures-2
R data structures-2R data structures-2
R data structures-2
Victor Ordu
Data Warehousing
Data WarehousingData Warehousing
Data Warehousing
SHIKHA GAUTAM
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
4Science
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
Rupali Rana
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year? BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
panagenda
NAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docxNAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docx
rosemarybdodson23141
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docx
jasoninnes20
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
Narendranath Reddy T
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
Himanshu Sahu
Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...Google jeff dean lessons learned while building infrastructure software at go...
Google jeff dean lessons learned while building infrastructure software at go...
xu liwei
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
Elasticsearch - basics and beyond
Elasticsearch - basics and beyondElasticsearch - basics and beyond
Elasticsearch - basics and beyond
Ernesto Reig
Questions On The Code And Core Module
Questions On The Code And Core ModuleQuestions On The Code And Core Module
Questions On The Code And Core Module
Katie Gulley
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
FOSSASIA 2015 - 10 Features your developers are missing when stuck with Propr...
Ashnikbiz
Elastic meetup june16
Elastic meetup june16Elastic meetup june16
Elastic meetup june16
Miguel Bosin
R data structures-2
R data structures-2R data structures-2
R data structures-2
Victor Ordu
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
Putting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAMPutting Historical Data in Context: how to use DSpace-GLAM
Putting Historical Data in Context: how to use DSpace-GLAM
4Science
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
IJERD Editor
Ch 2-introduction to dbms
Ch 2-introduction to dbmsCh 2-introduction to dbms
Ch 2-introduction to dbms
Rupali Rana
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year? BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
BP301: Q: Whats Your Second Most Valuable Asset and Nearly Doubles Every Year?
panagenda
NAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docxNAME __________________________________________IS 3003CLASS0.docx
NAME __________________________________________IS 3003CLASS0.docx
rosemarybdodson23141
BUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docxBUS105Business Information SystemsWorkshop Week 3.docx
BUS105Business Information SystemsWorkshop Week 3.docx
jasoninnes20
A complete guide to azure storage
A complete guide to azure storageA complete guide to azure storage
A complete guide to azure storage
Himanshu Sahu

Recently uploaded (20)

A Simple Introduction to data Science- what is it and what does it do
A Simple Introduction to data Science- what is it and what does it doA Simple Introduction to data Science- what is it and what does it do
A Simple Introduction to data Science- what is it and what does it do
sarah mabrouk
DII-WS Training Manual with Links_V2.pdf
DII-WS Training Manual with Links_V2.pdfDII-WS Training Manual with Links_V2.pdf
DII-WS Training Manual with Links_V2.pdf
coolprince739
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdhFOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
cshdhdhvfsbzdb
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.pptchap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Nikhil620181
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable GrowthBoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
Business of Software Conference
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
soniaseo850
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdfOrange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
gemmajfrancisco
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbdGE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
HarleySamboFavor
Information Security Management-Planning 1.pptx
Information Security Management-Planning 1.pptxInformation Security Management-Planning 1.pptx
Information Security Management-Planning 1.pptx
FrancisFayiah
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdfCHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
yasinalistudy
PPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
PPTjhjhghhhghghghggvgfggffgftftftftftft.pptPPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
PPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
vmanjusundertamil21
IT Professional Ethics, Moral and Cu.ppt
IT Professional Ethics, Moral and Cu.pptIT Professional Ethics, Moral and Cu.ppt
IT Professional Ethics, Moral and Cu.ppt
FrancisFayiah
Chapter-4-Plane-Wave-Propagation-pdf.pdf
Chapter-4-Plane-Wave-Propagation-pdf.pdfChapter-4-Plane-Wave-Propagation-pdf.pdf
Chapter-4-Plane-Wave-Propagation-pdf.pdf
ShamsAli42
537116365-Domain-6-Presentation-New.pptx
537116365-Domain-6-Presentation-New.pptx537116365-Domain-6-Presentation-New.pptx
537116365-Domain-6-Presentation-New.pptx
PorshaAbril1
LITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
LITERATURE-MODEL.pptxdddddddddddddddddddddddddddddddddLITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
LITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
Maimai708843
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
berihun18
Turinton Insights - Enterprise Agentic AI Platform
Turinton Insights - Enterprise Agentic AI PlatformTurinton Insights - Enterprise Agentic AI Platform
Turinton Insights - Enterprise Agentic AI Platform
vikrant530668
Chat Bots - An Analytical study including Indian players
Chat Bots - An Analytical study including Indian playersChat Bots - An Analytical study including Indian players
Chat Bots - An Analytical study including Indian players
DR. Ram Kumar Pathak
ARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider RegistryARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider Registry
Allen Shaw
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
SrideviPcSenthilkuma
A Simple Introduction to data Science- what is it and what does it do
A Simple Introduction to data Science- what is it and what does it doA Simple Introduction to data Science- what is it and what does it do
A Simple Introduction to data Science- what is it and what does it do
sarah mabrouk
DII-WS Training Manual with Links_V2.pdf
DII-WS Training Manual with Links_V2.pdfDII-WS Training Manual with Links_V2.pdf
DII-WS Training Manual with Links_V2.pdf
coolprince739
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdhFOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
FOOD LAWS.pptxbshdhdhdhdhdhhdhdhdhdhdhhdh
cshdhdhvfsbzdb
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.pptchap2_nnejjejehhehehhhhhhhhhehslides.ppt
chap2_nnejjejehhehehhhhhhhhhehslides.ppt
Nikhil620181
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable GrowthBoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
BoSEU25 | Diego de J坦dar | Why User Activation is the Key to Sustainable Growth
Business of Software Conference
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
cPanel Dedicated Server Hosting at Top-Tier Data Center comes with a Premier ...
soniaseo850
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdfOrange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
Orange County Tableau User Group 2025 Late Q1 2025-03-23.pdf
gemmajfrancisco
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbdGE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
GE-108-LESSON8.pptxbshsnsnsnsnsnsnnsnsnsnsnsbd
HarleySamboFavor
Information Security Management-Planning 1.pptx
Information Security Management-Planning 1.pptxInformation Security Management-Planning 1.pptx
Information Security Management-Planning 1.pptx
FrancisFayiah
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdfCHAP-0- Lecture Overview  Administration--TCPS (SS-2023)-Rev (1)--final.pdf
CHAP-0- Lecture Overview Administration--TCPS (SS-2023)-Rev (1)--final.pdf
yasinalistudy
PPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
PPTjhjhghhhghghghggvgfggffgftftftftftft.pptPPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
PPTjhjhghhhghghghggvgfggffgftftftftftft.ppt
vmanjusundertamil21
IT Professional Ethics, Moral and Cu.ppt
IT Professional Ethics, Moral and Cu.pptIT Professional Ethics, Moral and Cu.ppt
IT Professional Ethics, Moral and Cu.ppt
FrancisFayiah
Chapter-4-Plane-Wave-Propagation-pdf.pdf
Chapter-4-Plane-Wave-Propagation-pdf.pdfChapter-4-Plane-Wave-Propagation-pdf.pdf
Chapter-4-Plane-Wave-Propagation-pdf.pdf
ShamsAli42
537116365-Domain-6-Presentation-New.pptx
537116365-Domain-6-Presentation-New.pptx537116365-Domain-6-Presentation-New.pptx
537116365-Domain-6-Presentation-New.pptx
PorshaAbril1
LITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
LITERATURE-MODEL.pptxdddddddddddddddddddddddddddddddddLITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
LITERATURE-MODEL.pptxddddddddddddddddddddddddddddddddd
Maimai708843
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
7. PHP and gaghhgashgfsgajhfkhshfasMySQL.pptx
berihun18
Turinton Insights - Enterprise Agentic AI Platform
Turinton Insights - Enterprise Agentic AI PlatformTurinton Insights - Enterprise Agentic AI Platform
Turinton Insights - Enterprise Agentic AI Platform
vikrant530668
Chat Bots - An Analytical study including Indian players
Chat Bots - An Analytical study including Indian playersChat Bots - An Analytical study including Indian players
Chat Bots - An Analytical study including Indian players
DR. Ram Kumar Pathak
ARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider RegistryARCH 2025: New Mexico Respite Provider Registry
ARCH 2025: New Mexico Respite Provider Registry
Allen Shaw
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
Reason To Switch to DNNDNNs excel in handling huge volumes of data (e.g., ima...
SrideviPcSenthilkuma

Introduction to Accumulo

  • 1. Introduction to Accumulo Mario Pastorelli mario.pastorelli@teralytics.ch March 7, 2016 1
  • 2. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: 2
  • 3. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed 鍖lesystem 2
  • 4. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed 鍖lesystem MapReduce: distributed data processing 2
  • 5. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed 鍖lesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data 2
  • 6. History To accommodate their needs for analysis of large amounts of data on commodity hardware, Google developed three main distributed systems: GFS: distributed 鍖lesystem MapReduce: distributed data processing BigTable: distributed storage system for structured data Accumulo is an open-source implementation of BigTable 2
  • 7. Distributed Structured Data structured data should be distributed for parallel processing indexed for fast retrieval (structured means that it has some kind of primary key) tabular for easy processing of complex data, each row can potentially have many columns 3
  • 8. Distributed Structured Data structured data should be distributed for parallel processing indexed for fast retrieval (structured means that it has some kind of primary key) tabular for easy processing of complex data, each row can potentially have many columns databases o鍖er indexes and tables but dont scale without signi鍖cant e鍖ort 3
  • 9. Distributed Structured Data structured data should be distributed for parallel processing indexed for fast retrieval (structured means that it has some kind of primary key) tabular for easy processing of complex data, each row can potentially have many columns databases o鍖er indexes and tables but dont scale without signi鍖cant e鍖ort key-value stores can easily be distributed but have limited index support over keys and dont have support for tabular format out of the box 3
  • 10. Accumulo Accumulo is a key-value store with support for tabular data keys are columns identi鍖ers, i.e. they uniquely identify a column of a row a row is composed by multiple keys-values grouped by the pre鍖x of the key, the row id 4
  • 11. Example EMAIL NAME LASTNAME COMPANY olismith85@gmail.com Olivia Smith Winsystems emily.brown@facebook.com Emily Brown Jones Inc. KEY (composed by row id and column id) VALUE olismith85@gmail.comNAME Olivia olismith85@gmail.comLASTNAME Smith olismith85@gmail.comCOMPANY Winsystems emily.brown@facebook.comNAME Emily emily.brown@facebook.comLASTNAME Brown emily.brown@facebook.comCOMPANY Jones Inc. 5
  • 12. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which column group the key belongs to column quali鍖er: the column id column visibility: who can access this column timestamp: the version of the key 6
  • 13. Composite Keys Keys in Accumulo are composite and have the following components row id: to which row the key belongs to column family: to which column group the key belongs to column quali鍖er: the column id column visibility: who can access this column timestamp: the version of the key A single key-value is stored as KEY VALUE row id column timestamp family quali鍖er visibility 6
  • 14. Accumulo features range queries: keys are stored in lexicographical order allowing to query semantically close data e.g. temporal data can be stored such that aggregation of close days is local and fast 7
  • 15. Accumulo features range queries: keys are stored in lexicographical order allowing to query semantically close data e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds 7
  • 16. Accumulo features range queries: keys are stored in lexicographical order allowing to query semantically close data e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables 7
  • 17. Accumulo features range queries: keys are stored in lexicographical order allowing to query semantically close data e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data 7
  • 18. Accumulo features range queries: keys are stored in lexicographical order allowing to query semantically close data e.g. temporal data can be stored such that aggregation of close days is local and fast fast: with proper key schemas a query can take milliseconds scalable: designed to store huge amount of data over multiple tables built-in cache for recently queried data many others, such as bulk imports, iterators, fault tolerance, large rows, multiple-batch queries, testing utilities (mocks, miniclusters) . . . 7
  • 19. Example we want to store and analyze tweets from all around the world. 8
  • 20. Example: Tweets analysis A tweet has the following (simpli鍖ed) 鍖elds coordinate: geospatial information composed by longitude and latitude created at: UTC time of the tweet id: tweet unique identi鍖er user informations, such as user.id: unique identi鍖er of the user user.screen name: user name . . . entities such as hashtags, urls. . . text: tweet content . . . how do we store this data in Accumulo? 9
  • 21. Example: Tweets analysis there is no single way to do it, it depends on the query 10
  • 22. Example: Tweets analysis there is no single way to do it, it depends on the query two good practices work with denormalized data specialize tables for each kind of query 10
  • 23. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family quali鍖er visibility user.id + created at + id coordinate lon/lat entities hashtags hashtags urls urls text text Easy to process the entire timeline or a time interval for the same user 11
  • 24. Example: Twitter User Timeline schema KEY VALUE row id column timestamp family quali鍖er visibility user.id + created at + id coordinate lon/lat entities hashtags hashtags urls urls text text Easy to process the entire timeline or a time interval for the same user Not good for other kind of analysis 鍖nd all the tweets with a given hashtag 鍖nd all the tweets in New York . . . 11
  • 25. Summary Accumulo is great for storing large amount of structured data Accumulo is good for interactive queries as well as more batch queries Accumulo is a low-level system NoSQL (thats not good!), which means no high-level language to query the data a lot of 鍖exibility which can easily back鍖re 12