際際滷

際際滷Share a Scribd company logo
Asanka Padmakumara
Business Intelligence Consultant,
 Blog: asankap.wordpress.com
 Linked In: linkedin.com/in/asankapadmakumara
 Twitter: @asanka_e
 Facebook: facebook.com/asankapk
Move Your On-
Prem Data to a
Lake in the
Clouds
Agenda
 Where are we right now?
 Why we need to go for Data Lake?
 What is Azure Data Lake?
 How do we get there?
 Demo
 Q & A
Where are we right now?
What are the challenges?
 Limited storage
 Limited processing power
 High hardware cost
 High maintains cost
 No disaster recovery
 Availability and reliability issues
 Scalability issues
 Security
 Solution: Azure Data Lake
What is Azure Data Lake?
 Highly scalable data storage and analytics service
 Intended for big data storage and analysis
 A faster and efficient solution than on-prem data centers
 Three services:
Analytics
Storage
HDInsight
(managed clusters)
Azure Data Lake Analytics
Azure Data Lake Storage
Azure Data Lake Architecture
Azure Data Lake Store
 Built for Hadoop
 Compatible with most components in Hadoop Eco-
systems
 Web HDFS API
 Unlimited storage, petabyte files
 Performance-tuned for big data analytics
 High throughput, IOPs
 Multiple parts of a file in multiple servers:
Parallel reading
 Enterprise-ready: Highly-available and secure
 All Data, One Place
 Any Data in native format
 No schema, No prior processing
Optimized for Big Data Analytics
 Multiple copies of same file in
improve reading
 Locally-redundant
(multiple copies of data in one Azure
region)
 Parallel reading and writing
 Configurable throughput
 No Limitation in file size or storage
Secure Data in Azure Data Lake Store
 Authentication
 Azure Active Directory
 All AAD features
 End-user authentication or Service-to-service authentication
 Access Control
 POSIX-style permissions
 Read, Write, Execute
 ACLs can be enabled on the root folder, on subfolders, and on individual files.
 Encryption
 Encryption at rest
 Encryption at transit -HTTPS
How to ingest data to Azure Data Lake Store
 Small Data Sets
 Azure Portal
 Azure Power Shell
 Azure  Cross Platform CLI 2.0
 Data Lake Tools For Visual Studio
 Streamed data
 Azure Stream Analytics
 Azure HDInsight Storm
 Data Lake Store .NET SDK
 Relational data
 Apache Sqoop
 Azure Data Factory
 Large Data Set
 Azure Power Shell
 Azure  Cross Platform CLI 2.0
 Azure Data Lake Store .NET SDK
 Azure Data Factory
 Really Large Data Sets
 Azure ExpressRoute
 Azure Import/Export service
How it different from Azure Blob Storage
Azure Data Lake Store Azure Blob Storage
Purpose
Optimized storage for big data analytics
workloads
General purpose
Use Case
Batch, interactive, streaming analytics and
machine learning data such as log files, IoT
data, click streams, large datasets
Any type of text or binary data, such
as application back end, backup data,
media storage for streaming and
general purpose data
Key Concepts
Contains folders, which in turn contains data
stored as files
Contains containers, which in turn has
data in the form of blobs
Size limits
No limits on account sizes, file sizes or number
of files
500 TiB
Geo-redundancy
Locally-redundant (multiple copies of data in
one Azure region)
Locally redundant (LRS), globally
redundant (GRS), read-access globally
redundant (RA-GRS).
Azure Data Lake Analytics
 Massive processing power
 Adjustable parallelism
 No server, VM, Cluster to
maintain.
 Pay for the Job
 Use existing .Net, R and
Python libraries.
 New language : U-SQL
C#SQL
U-SQL
 Combination of Declarative Logic of SQL and Procedure
logic of C#
 Case sensitive
 Schema on Read
U-SQL
@ExtraRuns =
SELECT IPLYear, Bowler,
SUM( string.IsNullOrWhiteSpace(ExtraRuns)? 0:
Convert.ToInt32(ExtraRuns)
) AS ExtraRuns,
ExtraType
FROM @MatchData
GROUP BY IPLYear,Bowler,ExtraType;
How do we go there? Azure Data Factory
Your feedbacks are essential to me ..!
Demo/ Q&A
Pricing
 Pay-as-you-go
 For a 1 TB storage, for a month = $39.94
 Monthly commitment packages
 For a 1 TB storage, for a month = $35
 Usage base:
https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
Usage Price
Write operations (per 10,000) $0.05
Read operations (per 10,000) $0.004
Delete operations Free
Transaction size limit No limit

More Related Content

Move your on prem data to a lake in a Lake in Cloud

  • 1. Asanka Padmakumara Business Intelligence Consultant, Blog: asankap.wordpress.com Linked In: linkedin.com/in/asankapadmakumara Twitter: @asanka_e Facebook: facebook.com/asankapk
  • 2. Move Your On- Prem Data to a Lake in the Clouds
  • 3. Agenda Where are we right now? Why we need to go for Data Lake? What is Azure Data Lake? How do we get there? Demo Q & A
  • 4. Where are we right now?
  • 5. What are the challenges? Limited storage Limited processing power High hardware cost High maintains cost No disaster recovery Availability and reliability issues Scalability issues Security Solution: Azure Data Lake
  • 6. What is Azure Data Lake? Highly scalable data storage and analytics service Intended for big data storage and analysis A faster and efficient solution than on-prem data centers Three services: Analytics Storage HDInsight (managed clusters) Azure Data Lake Analytics Azure Data Lake Storage
  • 7. Azure Data Lake Architecture
  • 8. Azure Data Lake Store Built for Hadoop Compatible with most components in Hadoop Eco- systems Web HDFS API Unlimited storage, petabyte files Performance-tuned for big data analytics High throughput, IOPs Multiple parts of a file in multiple servers: Parallel reading Enterprise-ready: Highly-available and secure All Data, One Place Any Data in native format No schema, No prior processing
  • 9. Optimized for Big Data Analytics Multiple copies of same file in improve reading Locally-redundant (multiple copies of data in one Azure region) Parallel reading and writing Configurable throughput No Limitation in file size or storage
  • 10. Secure Data in Azure Data Lake Store Authentication Azure Active Directory All AAD features End-user authentication or Service-to-service authentication Access Control POSIX-style permissions Read, Write, Execute ACLs can be enabled on the root folder, on subfolders, and on individual files. Encryption Encryption at rest Encryption at transit -HTTPS
  • 11. How to ingest data to Azure Data Lake Store Small Data Sets Azure Portal Azure Power Shell Azure Cross Platform CLI 2.0 Data Lake Tools For Visual Studio Streamed data Azure Stream Analytics Azure HDInsight Storm Data Lake Store .NET SDK Relational data Apache Sqoop Azure Data Factory Large Data Set Azure Power Shell Azure Cross Platform CLI 2.0 Azure Data Lake Store .NET SDK Azure Data Factory Really Large Data Sets Azure ExpressRoute Azure Import/Export service
  • 12. How it different from Azure Blob Storage Azure Data Lake Store Azure Blob Storage Purpose Optimized storage for big data analytics workloads General purpose Use Case Batch, interactive, streaming analytics and machine learning data such as log files, IoT data, click streams, large datasets Any type of text or binary data, such as application back end, backup data, media storage for streaming and general purpose data Key Concepts Contains folders, which in turn contains data stored as files Contains containers, which in turn has data in the form of blobs Size limits No limits on account sizes, file sizes or number of files 500 TiB Geo-redundancy Locally-redundant (multiple copies of data in one Azure region) Locally redundant (LRS), globally redundant (GRS), read-access globally redundant (RA-GRS).
  • 13. Azure Data Lake Analytics Massive processing power Adjustable parallelism No server, VM, Cluster to maintain. Pay for the Job Use existing .Net, R and Python libraries. New language : U-SQL
  • 14. C#SQL U-SQL Combination of Declarative Logic of SQL and Procedure logic of C# Case sensitive Schema on Read U-SQL @ExtraRuns = SELECT IPLYear, Bowler, SUM( string.IsNullOrWhiteSpace(ExtraRuns)? 0: Convert.ToInt32(ExtraRuns) ) AS ExtraRuns, ExtraType FROM @MatchData GROUP BY IPLYear,Bowler,ExtraType;
  • 15. How do we go there? Azure Data Factory
  • 16. Your feedbacks are essential to me ..!
  • 18. Pricing Pay-as-you-go For a 1 TB storage, for a month = $39.94 Monthly commitment packages For a 1 TB storage, for a month = $35 Usage base: https://azure.microsoft.com/en-us/pricing/details/data-lake-store/ Usage Price Write operations (per 10,000) $0.05 Read operations (per 10,000) $0.004 Delete operations Free Transaction size limit No limit

Editor's Notes

  • #5: On prem Lots of data Limited space Maintain the servers Lot of processing power
  • #6: Grow hardware on demand Upgrade instantly Availability and readability : Multiple copies of data, Down time for maintains, hardware familiar causes business issues Increase decrease hardware on demand Ability to fail fast, if fail , no need of hardware Ability move into latest technologies Scalability: take time to scale
  • #9: Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS) Works with application which support Web HDFS 3 copy in a single IOPS: input output operations per seconds
  • #10: Automatically optimized for any throughput
  • #14: 250 AU max: 1 AU= 2 core cpu, 6 GB ram Pay As you go: Price: 1 Au for 1 Hour 2$ Monthly : 100 Au , 100$
  • #15: Declarative logic Procedure logic sql to query, C# to customize Case sensitive C# data type C# comparison Some commonly used SQL keywords, including WHILE, UPDATE, and MERGE are not supported in U-SQL
  • #16: A cloud integration service Workflow called Pipelines Activities in pipeline Integration Run time: self hosted Activities : Copy data, run ssis packages, execute SPs, Execute U-SQL queries Price: no of activity runs and data moment hours Or SSIS runtime based on VM and time