際際滷

際際滷Share a Scribd company logo
Introduction to Data Warehousing
2 
Heterogeneities are everywhere 
 Different interfaces 
 Different data representations 
 Duplicate and inconsistent information 
Personal 
Databases 
Digital Libraries 
Scientific Databases 
World 
Wide 
Web
 Vertical fragmentation of informational 
systems (vertical stove pipes) 
 Result of application (user)-driven 
development of operational systems 
3 
Sales Planning 
Stock Mngmt 
... 
Suppliers 
Num. Control 
Debt Mngmt 
... 
Inventory 
... 
Sales Administration Finance Manufacturing ...
Personal 
Databases 
4 
Integration System 
World 
Wide 
Web 
Digital Libraries Scientific Databases 
揃 Collects and combines information 
揃 Provides integrated view, uniform user interface 
揃 Supports sharing
5 
揃 Two Approaches: 
- Query-Driven (Lazy) 
- Warehouse (Eager) 
? 
Source Source
6 
揃 Query-driven (lazy, on-demand) 
Clients 
Integration System 
Wrapper Wrapper Wrapper 
. . . 
. . . 
Metadata 
Source Source Source
即 Delay in query processing 
即 Slow or unavailable information sources 
即 Complex filtering and integration 
即 Inefficient and potentially expensive for 
frequent queries 
即 Competes with local processing at sources 
即 Hasnt caught on in industry 
7
8 
Clients 
DDaattaa 
WWaarreehhoouussee 
Integration System 
. . . 
揃 Information 
integrated in 
advance 
揃 Stored in wh for 
direct querying 
and analysis 
Extractor/ 
Monitor 
. . . 
Metadata 
Extractor/ 
Monitor 
Extractor/ 
Monitor 
Source Source Source
 High query performance 
 But not necessarily most current information 
 Doesnt interfere with local processing at sources 
 Complex queries at warehouse 
 OLTP at information sources 
 Information copied at warehouse 
 Can modify, annotate, summarize, restructure, etc. 
 Can store historical information 
 Security, no auditing 
 Has caught on in industry 
9
 Query-driven approach still better for 
 Rapidly changing information 
 Rapidly changing information sources 
 Truly vast amounts of data from large numbers of 
sources 
 Clients with unpredictable needs 
10
A data warehouse is simply a single, complete, 
and consistent store of data obtained from a 
variety of sources and made available to end 
users in a way they can understand and use it 
in a business context. 
-- Barry Devlin, IBM Consultant 
11
A DW is a 
 subject-oriented, 
 integrated, 
 time-varying, 
 non-volatile 
collection of data that is used primarily in 
organizational decision making. 
-- W.H. Inmon, Building the Data Warehouse, 
1992 
12
 Stored collection of diverse data 
 A solution to data integration problem 
 Single repository of information 
 Subject-oriented 
 Organized by subject, not by application 
 Used for analysis, data mining, etc. 
 Optimized differently from transaction-oriented 
db 
 User interface aimed at executive 
13
 Large volume of data (Gb, Tb) 
 Non-volatile 
 Historical 
 Time attributes are important 
 Updates infrequent 
 May be append-only 
 Examples 
 All transactions ever at Sainsburys 
 Complete client histories at insurance firm 
 LSE financial information and portfolios 
14
15 
Client Client 
Extractor/ 
Monitor 
Warehouse 
Extractor/ 
Monitor 
Extractor/ 
Monitor 
Integrator 
Design Phase 
Maintenance 
Loading 
... 
Metadata 
Optimization 
QQuueerryy & & A Annaalylyssisis
 Single-layer 
 Every data element is stored once only 
 Virtual warehouse 
 Two-layer 
 Real-time + derived data 
 Most commonly used approach in 
industry today 
Informational 
systems 
16 
Operational 
systems 
Real-time data 
Operational 
systems 
Informational 
systems 
Derived Data 
Real-time data
 Transformation of real-time data to derived 
data really requires two steps 
View level 
Particular informational 
needs 
Reconciled Data Physical Implementation 
17 
Operational 
systems 
Informational 
systems 
Derived Data 
Real-time data 
of the Data Warehouse
(1) How to get information into warehouse 
Data warehousing 
(2) What to do with data once its in warehouse 
Warehouse DBMS 
 Both rich research areas 
 Industry has focused on (2) 
18
 Warehouse Design 
 Extraction 
 Wrappers, monitors (change detectors) 
 Integration 
 Cleansing & merging 
 Warehousing specification & Maintenance 
 Optimizations 
 Miscellaneous (e.g., evolution) 
19
20 
揃 OLTP: On Line Transaction Processing 
- Describes processing at operational sites 
揃 OLAP: On Line Analytical Processing 
- Describes processing at warehouse
Standard DB (OLTP) 
 Mostly updates 
 Many small 
transactions 
 Mb - Gb of data 
 Current snapshot 
 Index/hash on p.k. 
 Raw data 
 Thousands of users 
(e.g., clerical users) 
21 
Warehouse (OLAP) 
揃 Mostly reads 
揃 Queries are long and complex 
揃 Gb - Tb of data 
揃 History 
揃 Lots of scans 
揃 Summarized, reconciled data 
揃 Hundreds of users (e.g., 
decision-makers, analysts)

More Related Content

Introduction to Data Warehousing

  • 2. 2 Heterogeneities are everywhere Different interfaces Different data representations Duplicate and inconsistent information Personal Databases Digital Libraries Scientific Databases World Wide Web
  • 3. Vertical fragmentation of informational systems (vertical stove pipes) Result of application (user)-driven development of operational systems 3 Sales Planning Stock Mngmt ... Suppliers Num. Control Debt Mngmt ... Inventory ... Sales Administration Finance Manufacturing ...
  • 4. Personal Databases 4 Integration System World Wide Web Digital Libraries Scientific Databases 揃 Collects and combines information 揃 Provides integrated view, uniform user interface 揃 Supports sharing
  • 5. 5 揃 Two Approaches: - Query-Driven (Lazy) - Warehouse (Eager) ? Source Source
  • 6. 6 揃 Query-driven (lazy, on-demand) Clients Integration System Wrapper Wrapper Wrapper . . . . . . Metadata Source Source Source
  • 7. 即 Delay in query processing 即 Slow or unavailable information sources 即 Complex filtering and integration 即 Inefficient and potentially expensive for frequent queries 即 Competes with local processing at sources 即 Hasnt caught on in industry 7
  • 8. 8 Clients DDaattaa WWaarreehhoouussee Integration System . . . 揃 Information integrated in advance 揃 Stored in wh for direct querying and analysis Extractor/ Monitor . . . Metadata Extractor/ Monitor Extractor/ Monitor Source Source Source
  • 9. High query performance But not necessarily most current information Doesnt interfere with local processing at sources Complex queries at warehouse OLTP at information sources Information copied at warehouse Can modify, annotate, summarize, restructure, etc. Can store historical information Security, no auditing Has caught on in industry 9
  • 10. Query-driven approach still better for Rapidly changing information Rapidly changing information sources Truly vast amounts of data from large numbers of sources Clients with unpredictable needs 10
  • 11. A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context. -- Barry Devlin, IBM Consultant 11
  • 12. A DW is a subject-oriented, integrated, time-varying, non-volatile collection of data that is used primarily in organizational decision making. -- W.H. Inmon, Building the Data Warehouse, 1992 12
  • 13. Stored collection of diverse data A solution to data integration problem Single repository of information Subject-oriented Organized by subject, not by application Used for analysis, data mining, etc. Optimized differently from transaction-oriented db User interface aimed at executive 13
  • 14. Large volume of data (Gb, Tb) Non-volatile Historical Time attributes are important Updates infrequent May be append-only Examples All transactions ever at Sainsburys Complete client histories at insurance firm LSE financial information and portfolios 14
  • 15. 15 Client Client Extractor/ Monitor Warehouse Extractor/ Monitor Extractor/ Monitor Integrator Design Phase Maintenance Loading ... Metadata Optimization QQuueerryy & & A Annaalylyssisis
  • 16. Single-layer Every data element is stored once only Virtual warehouse Two-layer Real-time + derived data Most commonly used approach in industry today Informational systems 16 Operational systems Real-time data Operational systems Informational systems Derived Data Real-time data
  • 17. Transformation of real-time data to derived data really requires two steps View level Particular informational needs Reconciled Data Physical Implementation 17 Operational systems Informational systems Derived Data Real-time data of the Data Warehouse
  • 18. (1) How to get information into warehouse Data warehousing (2) What to do with data once its in warehouse Warehouse DBMS Both rich research areas Industry has focused on (2) 18
  • 19. Warehouse Design Extraction Wrappers, monitors (change detectors) Integration Cleansing & merging Warehousing specification & Maintenance Optimizations Miscellaneous (e.g., evolution) 19
  • 20. 20 揃 OLTP: On Line Transaction Processing - Describes processing at operational sites 揃 OLAP: On Line Analytical Processing - Describes processing at warehouse
  • 21. Standard DB (OLTP) Mostly updates Many small transactions Mb - Gb of data Current snapshot Index/hash on p.k. Raw data Thousands of users (e.g., clerical users) 21 Warehouse (OLAP) 揃 Mostly reads 揃 Queries are long and complex 揃 Gb - Tb of data 揃 History 揃 Lots of scans 揃 Summarized, reconciled data 揃 Hundreds of users (e.g., decision-makers, analysts)