際際滷

際際滷Share a Scribd company logo
Lambda Architecture
Use Case: Mayo Clinic
FEBRUARY 2015
Altan Khendup  Leader, UDA Architecture COE
2
Background of Lambda Architecture
Background
 Reference architecture for Big Data systems
 Designed by Nathan Marz (Twitter)
 Defined as a system that runs arbitrary functions on arbitrary
data
 query = function(all data)
Design Principles
 Human fault-tolerant, Immutability, Computable
Lambda Layers
 Batch - Contains the immutable, constantly growing master
dataset.
 Speed - Deals only with new data and compensates for the
high latency updates of the serving layer.
 Serving - Loads and exposes the combined view of data so
that they can be queried.
3
Overview of Lambda Architecture
4 息 2014 Teradata
USE CASE  MAYO CLINIC
Mayo Clinic History
Every year, more than a million people from all 50 states
and nearly 150 countries come for care
Dozens of locations in several states with major
campuses in Rochester, Minn.; Scottsdale and Phoenix,
Ariz.; and Jacksonville, Fla.
Mayo Clinic Rochester, Minn. recognized as the top
hospital in the nation for 2014-2015 by U.S. News &
World Report
Why Big Data?
Challenges in Medical Data
Health data tends to be wide, not deep
New data types are becoming more important
Unstructured
Real-time streaming
A challenge to generally move from retrospective BI
viewing to event-based and predictive analytics usage
Multiple layers
Lots of events, data
Complex
Lots of different languages and data structures
Difficult to maintain
Lots of moving pieces/components/technologies
Lots of changes in the business
Data Discovery
Many Big Data stories start with data discovery
The Data Lake, etc.
But, data discovery is not predictable!
Mayo Clinic needed to define a real operational need
that a Big Data technology stack could fulfill
Project
Optimize an existing Natural Language Processing
pipeline in support of critical Colorectal Surgery
(Move to tens of thousands of documents processed)
Replace an existing free-text search facility used by
Clinical Web Service for colorectal cancer
(Move search to milliseconds)
9
Overall Architecture
10
 Current Storm throughput up to 1.5 million documents per hour
 Average of 140,000 HL7 messages actually processed per day with
average latency of 60 milliseconds from ingest to persistence
 Average of 50,000 documents passed through annotators per day
versus 5,000 historically
 Actual annotations of documents up to 6 times faster than previously
accomplished
 Free-text search use cases that took over 30 minutes on old
infrastructure completing in milliseconds in ElasticSearch
Operational Statistics
11
 Benefits
 An architecturally-driven, internally-owned technology stack that blends:
- An event-based/real-time processing fabric
- A multi-destination distillation hub
- A foundation for Classic BI delivery techniques
- A foundation for Services-based delivery techniques
- A serendipitous discovery environment
 Mutually supportive components that combine in delivering novel clinical
solutions
 Data continuity
- Historical data can be assessed as algorithms change over time
Summary
12
Thank you! Were Hiring!
thinkbigcareers.teradata.com
Altan Khendup (@madmongol)
Altan.khendup@teradata.com
Ron Bodkin (@ronbodkin)
Ron.bodkin@thinkbiganalytics.com

More Related Content

Lambda Architecture The Hive

  • 1. Lambda Architecture Use Case: Mayo Clinic FEBRUARY 2015 Altan Khendup Leader, UDA Architecture COE
  • 2. 2 Background of Lambda Architecture Background Reference architecture for Big Data systems Designed by Nathan Marz (Twitter) Defined as a system that runs arbitrary functions on arbitrary data query = function(all data) Design Principles Human fault-tolerant, Immutability, Computable Lambda Layers Batch - Contains the immutable, constantly growing master dataset. Speed - Deals only with new data and compensates for the high latency updates of the serving layer. Serving - Loads and exposes the combined view of data so that they can be queried.
  • 3. 3 Overview of Lambda Architecture
  • 4. 4 息 2014 Teradata USE CASE MAYO CLINIC
  • 5. Mayo Clinic History Every year, more than a million people from all 50 states and nearly 150 countries come for care Dozens of locations in several states with major campuses in Rochester, Minn.; Scottsdale and Phoenix, Ariz.; and Jacksonville, Fla. Mayo Clinic Rochester, Minn. recognized as the top hospital in the nation for 2014-2015 by U.S. News & World Report
  • 6. Why Big Data? Challenges in Medical Data Health data tends to be wide, not deep New data types are becoming more important Unstructured Real-time streaming A challenge to generally move from retrospective BI viewing to event-based and predictive analytics usage Multiple layers Lots of events, data Complex Lots of different languages and data structures Difficult to maintain Lots of moving pieces/components/technologies Lots of changes in the business
  • 7. Data Discovery Many Big Data stories start with data discovery The Data Lake, etc. But, data discovery is not predictable! Mayo Clinic needed to define a real operational need that a Big Data technology stack could fulfill
  • 8. Project Optimize an existing Natural Language Processing pipeline in support of critical Colorectal Surgery (Move to tens of thousands of documents processed) Replace an existing free-text search facility used by Clinical Web Service for colorectal cancer (Move search to milliseconds)
  • 10. 10 Current Storm throughput up to 1.5 million documents per hour Average of 140,000 HL7 messages actually processed per day with average latency of 60 milliseconds from ingest to persistence Average of 50,000 documents passed through annotators per day versus 5,000 historically Actual annotations of documents up to 6 times faster than previously accomplished Free-text search use cases that took over 30 minutes on old infrastructure completing in milliseconds in ElasticSearch Operational Statistics
  • 11. 11 Benefits An architecturally-driven, internally-owned technology stack that blends: - An event-based/real-time processing fabric - A multi-destination distillation hub - A foundation for Classic BI delivery techniques - A foundation for Services-based delivery techniques - A serendipitous discovery environment Mutually supportive components that combine in delivering novel clinical solutions Data continuity - Historical data can be assessed as algorithms change over time Summary
  • 12. 12 Thank you! Were Hiring! thinkbigcareers.teradata.com Altan Khendup (@madmongol) Altan.khendup@teradata.com Ron Bodkin (@ronbodkin) Ron.bodkin@thinkbiganalytics.com

Editor's Notes

  • #3: Lambda = architectural pattern to talk about the complexity of dealing with real-time and historical datasets Overall use Prescriptive/Predictive uses rely on some dimension of real-time Use cases CPG consumer goods looking at what customers are doing in real-time and making adjustments Medical real-time medical sensors and treatment and labs for critical patient care Financial credit risk and transaction fraud Manufacturers IoT/Telematics getting information from their plants and logistics, cross referencing to inventory, and making adjustments to supply chain
  • #4: General architecture that covers how Lambda works overall Able to address real-time and historical data Layers Speed real-time/current data streams; spark, storm, etc. Batch historical data layer Serving ability to take the current data and historical and merge the results and provide that to the organization Real-world experience/strategy Do not tackle all of the data but rather necessary segments of business functionality called queries Data can be tackled per query hence the idea of query focused datasets or qfds Allows for more focused results/faster speed gains
  • #8: Data Discovery & Analytics
  • #11: HL7 actual processing based on pull requests from users not actual processing power HL7 are large xml-based documents Much larger than say JSON or others (roughly 800k-900k in size) Contains significant data related to medical information End goal An architecturally-driven, internally-owned technology stack that blends: An event-based processing fabric A real-time processing framework A multi-destination distillation hub Classic BI delivery techniques Services-based delivery techniques A serendipitous discovery environment Mutually supportive components that combine in delivering novel clinical solutions.
  • #12: Challenges Cross industry similarities
  • #13: Were hiring! Ron Bodkin + my contact