The document discusses data integration and data warehousing. It notes that information systems are fragmented across different domains and represent data inconsistently. A data warehouse provides a single, integrated view of data from multiple sources by collecting and combining information in one location. There are two main approaches for data integration - a query-driven method that integrates data on demand, and a data warehouse approach that integrates data in advance for direct querying and analysis. The data warehouse approach has seen more industry adoption.
1 of 21
Downloaded 12 times
More Related Content
Introduction to Data Warehousing
2. 2
Heterogeneities are everywhere
Different interfaces
Different data representations
Duplicate and inconsistent information
Personal
Databases
Digital Libraries
Scientific Databases
World
Wide
Web
3. Vertical fragmentation of informational
systems (vertical stove pipes)
Result of application (user)-driven
development of operational systems
3
Sales Planning
Stock Mngmt
...
Suppliers
Num. Control
Debt Mngmt
...
Inventory
...
Sales Administration Finance Manufacturing ...
4. Personal
Databases
4
Integration System
World
Wide
Web
Digital Libraries Scientific Databases
揃 Collects and combines information
揃 Provides integrated view, uniform user interface
揃 Supports sharing
7. 即 Delay in query processing
即 Slow or unavailable information sources
即 Complex filtering and integration
即 Inefficient and potentially expensive for
frequent queries
即 Competes with local processing at sources
即 Hasnt caught on in industry
7
8. 8
Clients
DDaattaa
WWaarreehhoouussee
Integration System
. . .
揃 Information
integrated in
advance
揃 Stored in wh for
direct querying
and analysis
Extractor/
Monitor
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
Source Source Source
9. High query performance
But not necessarily most current information
Doesnt interfere with local processing at sources
Complex queries at warehouse
OLTP at information sources
Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing
Has caught on in industry
9
10. Query-driven approach still better for
Rapidly changing information
Rapidly changing information sources
Truly vast amounts of data from large numbers of
sources
Clients with unpredictable needs
10
11. A data warehouse is simply a single, complete,
and consistent store of data obtained from a
variety of sources and made available to end
users in a way they can understand and use it
in a business context.
-- Barry Devlin, IBM Consultant
11
12. A DW is a
subject-oriented,
integrated,
time-varying,
non-volatile
collection of data that is used primarily in
organizational decision making.
-- W.H. Inmon, Building the Data Warehouse,
1992
12
13. Stored collection of diverse data
A solution to data integration problem
Single repository of information
Subject-oriented
Organized by subject, not by application
Used for analysis, data mining, etc.
Optimized differently from transaction-oriented
db
User interface aimed at executive
13
14. Large volume of data (Gb, Tb)
Non-volatile
Historical
Time attributes are important
Updates infrequent
May be append-only
Examples
All transactions ever at Sainsburys
Complete client histories at insurance firm
LSE financial information and portfolios
14
16. Single-layer
Every data element is stored once only
Virtual warehouse
Two-layer
Real-time + derived data
Most commonly used approach in
industry today
Informational
systems
16
Operational
systems
Real-time data
Operational
systems
Informational
systems
Derived Data
Real-time data
17. Transformation of real-time data to derived
data really requires two steps
View level
Particular informational
needs
Reconciled Data Physical Implementation
17
Operational
systems
Informational
systems
Derived Data
Real-time data
of the Data Warehouse
18. (1) How to get information into warehouse
Data warehousing
(2) What to do with data once its in warehouse
Warehouse DBMS
Both rich research areas
Industry has focused on (2)
18
20. 20
揃 OLTP: On Line Transaction Processing
- Describes processing at operational sites
揃 OLAP: On Line Analytical Processing
- Describes processing at warehouse
21. Standard DB (OLTP)
Mostly updates
Many small
transactions
Mb - Gb of data
Current snapshot
Index/hash on p.k.
Raw data
Thousands of users
(e.g., clerical users)
21
Warehouse (OLAP)
揃 Mostly reads
揃 Queries are long and complex
揃 Gb - Tb of data
揃 History
揃 Lots of scans
揃 Summarized, reconciled data
揃 Hundreds of users (e.g.,
decision-makers, analysts)