This document provides an overview of data mining using SQL Server. It discusses what data mining is, the data mining process, and key concepts and terminology. The technical platform for data mining in SQL Server is presented, including the server mining architecture with Analysis Services. The main steps for building a data mining model are outlined. Various data mining tasks and approaches for working with data mining in SQL Server are also highlighted, including using DMX queries. Finally, a demo of the data mining process using DMX is proposed.
3. 3
Agenda
Data vs. Information
What is Data Mining?
Technical Platform
Data Mining Process Overview
Key Concepts and Terminology
DEMO: Data Mining Process in Detail Using DMX
9. 9
Data Mining
Technologies for analysis of data and discovery of
(very) hidden patterns
Uses a combination of statistics, probability analysis
and database technologies
Fairly young (<20 years old) but clever algorithms
developed through database research
10. 10
What does Data Mining Do?
Explores
Your Data
Finds
Patterns -
Trends
Performs
Predictions
11. 11
Data Mining Tasks
Classification
Ph但n loa味i, x棚p ha味ng
Regression
H担i quy
Segmentation
Ph但n khuc
Association
Li棚n k棚t
Sequence Analysis
Ph但n tich chu担i, day
13. 13
Data acquisition and
integration from
multiple sources
Data transformation
and synthesis
Knowledge and
pattern detection
through Data Mining
Data enrichment with
logic rules and
hierarchical views
Data presentation
and distribution
Data publishing for
mass recipients
Integrate Analyze Report
SQL Server
We Need More Than Just Database Engine
15. 15
Server Mining Architecture
Analysis Services
Server
Mining Model
Data Mining Algorithm Data
Source
Excel/Visio/SSRS/Your App
OLE DB/ADOMD/XMLA
Deploy
BIDS
Excel
Visio
SSMS
App
Data
17. 17
Mining Model Mining ModelMining Model
Mining Process
DM EngineDM Engine
Training data
Data to be
predictedMining Model
With
predictions
18. 18
Steps for Building a DM Model
1. Model Creation
Define columns for cases: visually (BIDS), using DMX, or from PMML
2. Model Training
Feed lots of data from a real database, or from a system log
Congratulations! We now have a model
3. Model Testing
Test on sample data to check predictions.
Testing data must be different from training
If we get nonsense, adjust the algorithm, its parameters, model design, or even
data
4. Model Use (Exploration and Prediction)
Use the model on new data to predict outcomes
19. 19
Many Approaches
Work the way you like:
Database experts and SQL veterans:
Write queries in DMX (similar to T-SQL)
Everyone else:
Use Business Intelligence Development Studio (BIDS) rich GUI
included with SSAS
Hosted in Visual Studio (included!)
You dont have to program click-click instead
Use Excel 2007 with Data Mining Add-Ins
The Data Mining tab has everything you need
Table Analysis tab is easier but simplified
21. 21
Mining Structure
Describes data to be mined
Columns from a data source and their:
Data Type
Content Type
Contains Mining Models
Often we build several different models in one structure
Holds training data, known as Cases (if required)
Holds testing data, known as Holdout (in SQL 2008)
23. 23
Data Mining Model
Container of patterns discovered by a Data Mining
Algorithm amongst the training Cases
A table containing patterns
Expressed by visualisers
Specifies usage of columns already defined in the
Mining Structure
24. 24
Cases: The Things We Study
Case set of columns (attributes) you want to analyse
Age, Gender, Region, Annual Spending
Case Key unique ID of a case
26. 26
Data Mining Extensions
DMX
T-SQL for Data Mining
Easy! Like scripting for IT Pros
Two types of statements:
Data Definition
CREATE, ALTER, EXPORT, IMPORT, DROP
Data Manipulation
INSERT INTO, SELECT, DELETE
27. 27
DMX Just Like T-SQL
CREATE MINING MODEL CreditRisk
(CustID LONG KEY,
Gender TEXT DISCRETE,
Income LONG CONTINUOUS,
Profession TEXT DISCRETE,
Risk TEXT DISCRETE PREDICT)
USING Microsoft_Decision_Trees
INSERT INTO CreditRisk
(CustId, Gender, Income, Profession,
Risk)
Select
CustomerID, Gender, Income,
Profession,Risk
From Customers
Select NewCustomers.CustomerID, CreditRisk.Risk,
PredictProbability(CreditRisk.Risk)
FROM CreditRisk PREDICTION JOIN NewCustomers
ON CreditRisk.Gender=NewCustomer.Gender
AND CreditRisk.Income=NewCustomer.Income
AND CreditRisk.Profession=NewCustomer.Profession
28. 28
Demos Steps
1
Create Mining Structure
2
Create Mining Model
3
Process Mining Model
4
Test Model
5
Execute Prediction
#10: Statistics: thongkeProbability: xacsuatLa mot linhvuc con khatre, duocphattriencach day chua den 20 namnhungcacthuattoanduocphattrienkharo rang
#12: Gi畉i thu畉t ph但n lo畉i (Classification Algorithm) d畛 o叩n ra m畛t ho畉c nhi畛u gi叩 tr畛 bi畉n r畛i r畉c, d畛a tr棚n c叩c thu畛c t鱈nh kh叩c c畛a t畉p d畛 li畛u. i畛n h狸nh l gi畉i thu畉t C但y Quy畉t 畛nh Microsoft Decision Trees Algorithm.Gi畉i thu畉t 畛 qui (Regression Algorithm) d畛 o叩n m畛t ho畉c nhi畛u bi畉n gi叩 tr畛 li棚n t畛c, nh動 l畛i nhu畉n v gi叩 tr畛 thua l畛, d畛a tr棚n c叩c thu畛c t鱈nh d畛 li畛u kh叩c trong t畉p d畛 li畛u. i畛n h狸nh l gi畉i thu畉t chu畛i th畛i gian Microsoft Time Series Algorithm.Gi畉i thu畉t ph但n o畉n (Segmentation Algorithm) ph但n chia d畛 li畛u thnh nhi畛u nh坦m g畛m c叩c thnh ph畉n c坦 thu畛c t鱈nh t動董ng t畛 nhau. Gi畉i thu畉t i畛n h狸nh l Microsoft Clustering Algorithm.Gi畉i thu畉t t動董ng quan (Assocication Algorithm) t狸m s畛 t動董ng quan gi畛a c叩c thu畛c t鱈nh trong c畛ng t畉p d畛 li畛u. 畛ng d畛ng ph畛 bi畉n nh畉t c畛a gi畉i thu畉t ny l x但y d畛ng c叩c lu畉t t動董ng quan, ph但n t鱈ch gi畛 hng. Gi畉i thu畉t i畛n h狸nh lo畉i gi畉i thu畉t ny l Microsoft Assocciation AlgorithmGi畉i thu畉t ph但n t鱈ch tuy畉n t鱈nh (Sequence Analysis Allgorithm) t畛ng k畉t c叩c chu畛i ho畉c m畉ng d畛 li畛u trong t畉p d畛 li畛u. i畛n h狸nh cho lo畉i gi畉i thu畉t ny l Microsoft Sequence Clustering Algorithm
#16: SSMS: SQL Server Management StudioC担ng cu味 棚 ta味oracac Mining Model. Cacc担ng cu味 動董味c Microsoft cungc但pg担m co: Business Inteligence Development Studio, Excel, Visio, SQL Server Management Studio. Saukhita味oracac Mining Model, c但nphaitri棚nkhail棚nh棚味 th担ng Analysis Services (A.S). Analysis Service la n董iv但味nhanh, quanly cac Model.L動u y rngcac Model saukhi動董味ctri棚nkhail棚n A.S chi la cac Model r担ng. 棚 co th棚 動avaos動 du味ng, c但nphai qua m担味t qua trinhgo味i la Training Model (ho味c Process Model). Vi th棚 c但n棚nthanhph但nth動 3 o la Data Source. Data source la n董ich動ad動 li棚味uc但nthi棚tchovi棚味c Training Model va ca qua trinh Test Model. Vi th棚 c但nphai chia l動董味ng Data thanh 2 ph但nri棚ngbi棚味t棚 phu味c vu味 cho 2 tac vu味 tr棚n.Thanhph但nth動 4, o la cac動ngdu味ngkhaithaccac Mining Model a 動董味cx但yd動味ng. Cac動ngdu味ng co th棚 la cacph但nm棚m動董味c Microsoft cungc但pnh動 Excel, Visio ho味c動ngdu味ng do ng動董idungx但yd動味ng. Cac動ngdu味ngnayg董id動 li棚味ucuaminhxu担ng Analysis Service va nh但味nphanh担i la k棚t qua cua qua trinh Data Mining tr董 la味i.
#22: Mining Structures (Analysis Services - Data Mining)The mining structure defines the data from which mining models are built: it specifies the source data view, the number and type of columns, and an optional partition into training and testing sets. A single mining structure can support multiple mining models that share the same domain. The following diagram illustrates the relationship of the data mining structure to the data source, and to its constituent data mining models.
#23: The mining structure in the diagram is based on a data source that contains multiple tables, joined on the CustomerID field. One table contains information about customers, such as the geographical region, age, income and gender, while the related nested table contains multiple rows of additional information about each customer, such as products the customer has purchased. The diagram shows that multiple models can be built on one mining structure, and that the models can use different columns from the structure. Model 1 Uses CustomerID, Income, Age, Region, and filters the data on Region.Model 2 Uses CustomerID, Income, Age, Region and filters the data on Age.Model 3 Uses CustomerID, Age, Gender, and the nested table, with no filter.Because the models use different columns for input, and because two of the models additionally restrict the data that is used in the model by applying a filter, the models might have very different results even though they are based on the same data. Note that the CustomerID column is required in all models because it is the only available column that can be used as the case key.This section explains the basic architecture of data mining structures. For more information about how to create, manage, modify, or view data mining structures, see Managing Data Mining Structures and Models.
#24: A data mining model gets data from a mining structure and then analyzes that data by using a data mining algorithm. The mining structure and mining model are separate objects. The mining structure stores information that defines the data source. A mining model stores information derived from statistical processing of the data, such as the patterns found as a result of analysis. A mining model is empty until the data provided by the mining structure has been processed and analyzed. After a mining model has been processed, it contains metadata, results, and bindings back to the mining structure.