With the boom in data; the volume and its complexity, the trend is to move data to the cloud. Where and How do we do this? Azure gives you the answer. In this session, I will give you an introduction to Azure Data Lake and Azure Data Factory, and why they are good for the type of problem we are talking about. You will learn how large datasets can be stored on the cloud, and how you could transport your data to this store. The session will briefly cover Azure Data Lake as the modern warehouse for data on the cloud,
Convert to study materialsBETA
Transform any presentation into ready-made study materialselect from outputs like summaries, definitions, and practice questions.
1 of 18
Download to read offline
More Related Content
Move your on prem data to a lake in a Lake in Cloud
5. What are the challenges?
Limited storage
Limited processing power
High hardware cost
High maintains cost
No disaster recovery
Availability and reliability issues
Scalability issues
Security
Solution: Azure Data Lake
6. What is Azure Data Lake?
Highly scalable data storage and analytics service
Intended for big data storage and analysis
A faster and efficient solution than on-prem data centers
Three services:
Analytics
Storage
HDInsight
(managed clusters)
Azure Data Lake Analytics
Azure Data Lake Storage
8. Azure Data Lake Store
Built for Hadoop
Compatible with most components in Hadoop Eco-
systems
Web HDFS API
Unlimited storage, petabyte files
Performance-tuned for big data analytics
High throughput, IOPs
Multiple parts of a file in multiple servers:
Parallel reading
Enterprise-ready: Highly-available and secure
All Data, One Place
Any Data in native format
No schema, No prior processing
9. Optimized for Big Data Analytics
Multiple copies of same file in
improve reading
Locally-redundant
(multiple copies of data in one Azure
region)
Parallel reading and writing
Configurable throughput
No Limitation in file size or storage
10. Secure Data in Azure Data Lake Store
Authentication
Azure Active Directory
All AAD features
End-user authentication or Service-to-service authentication
Access Control
POSIX-style permissions
Read, Write, Execute
ACLs can be enabled on the root folder, on subfolders, and on individual files.
Encryption
Encryption at rest
Encryption at transit -HTTPS
11. How to ingest data to Azure Data Lake Store
Small Data Sets
Azure Portal
Azure Power Shell
Azure Cross Platform CLI 2.0
Data Lake Tools For Visual Studio
Streamed data
Azure Stream Analytics
Azure HDInsight Storm
Data Lake Store .NET SDK
Relational data
Apache Sqoop
Azure Data Factory
Large Data Set
Azure Power Shell
Azure Cross Platform CLI 2.0
Azure Data Lake Store .NET SDK
Azure Data Factory
Really Large Data Sets
Azure ExpressRoute
Azure Import/Export service
12. How it different from Azure Blob Storage
Azure Data Lake Store Azure Blob Storage
Purpose
Optimized storage for big data analytics
workloads
General purpose
Use Case
Batch, interactive, streaming analytics and
machine learning data such as log files, IoT
data, click streams, large datasets
Any type of text or binary data, such
as application back end, backup data,
media storage for streaming and
general purpose data
Key Concepts
Contains folders, which in turn contains data
stored as files
Contains containers, which in turn has
data in the form of blobs
Size limits
No limits on account sizes, file sizes or number
of files
500 TiB
Geo-redundancy
Locally-redundant (multiple copies of data in
one Azure region)
Locally redundant (LRS), globally
redundant (GRS), read-access globally
redundant (RA-GRS).
13. Azure Data Lake Analytics
Massive processing power
Adjustable parallelism
No server, VM, Cluster to
maintain.
Pay for the Job
Use existing .Net, R and
Python libraries.
New language : U-SQL
14. C#SQL
U-SQL
Combination of Declarative Logic of SQL and Procedure
logic of C#
Case sensitive
Schema on Read
U-SQL
@ExtraRuns =
SELECT IPLYear, Bowler,
SUM( string.IsNullOrWhiteSpace(ExtraRuns)? 0:
Convert.ToInt32(ExtraRuns)
) AS ExtraRuns,
ExtraType
FROM @MatchData
GROUP BY IPLYear,Bowler,ExtraType;
18. Pricing
Pay-as-you-go
For a 1 TB storage, for a month = $39.94
Monthly commitment packages
For a 1 TB storage, for a month = $35
Usage base:
https://azure.microsoft.com/en-us/pricing/details/data-lake-store/
Usage Price
Write operations (per 10,000) $0.05
Read operations (per 10,000) $0.004
Delete operations Free
Transaction size limit No limit
Editor's Notes
#5: On prem
Lots of data
Limited space
Maintain the servers
Lot of processing power
#6: Grow hardware on demand
Upgrade instantly
Availability and readability : Multiple copies of data, Down time for maintains, hardware familiar causes business issues
Increase decrease hardware on demand
Ability to fail fast, if fail , no need of hardware
Ability move into latest technologies
Scalability: take time to scale
#9: Apache Hadoop file system compatible with Hadoop Distributed File System (HDFS)
Works with application which support Web HDFS
3 copy in a single
IOPS: input output operations per seconds
#14: 250 AU max:
1 AU= 2 core cpu, 6 GB ram
Pay As you go:
Price: 1 Au for 1 Hour 2$
Monthly : 100 Au , 100$
#15: Declarative logic
Procedure logic
sql to query, C# to customize
Case sensitive
C# data type
C# comparison
Some commonly used SQL keywords, including WHILE, UPDATE, and MERGE are not supported in U-SQL
#16: A cloud integration service
Workflow called Pipelines
Activities in pipeline
Integration Run time: self hosted
Activities : Copy data, run ssis packages, execute SPs, Execute U-SQL queries
Price: no of activity runs and data moment hours
Or SSIS runtime based on VM and time