�ݺ�ߣ

1
DTS304TC: Machine Learning
Lecture 5: Building Machine Learning System
Dr Kang Dang
D-5032, Taicang Campus
Kang.Dang@xjtlu.edu.cn
Tel: 88973341

2
Machine Learning Pipeline
Machine learning involves a comprehensive
workflow, not just training models.

3
Q & A
In practical machine learning roles, what percentage of time do you
think is typically spent on data preparation and feature engineering?
(A) 20%
(B) 40%
(C) 60%
(D) 80%

4
Data Preparation and Feature Engineering
The features you use influence more than everything else the result. No algorithm
alone, to my knowledge, can supplement the information gain given by
correct feature engineering.
�� Luca Massaron

5
Q&A
? How would you handle missing values in a table? Fill with zeros or use
other methods? What issues might arise from filling with zeros?

6
Different types of missing values
? 3 Main Types of Missing Data | Do THIS Before Handling Missing Valu
es! �C YouTube

7
Missing Value Imputation
MISSING COMPLETELY
AT RANDOM
MCAR (Missing Completely at Random) means the missing data is random and doesn't
depend on anything else. For example, if survey answers are accidentally skipped or if a
person simply chooses not to answer a question.
Mean / Median/Mode Imputation, Random Sample Imputation
MISSING AT RANDOM MAR (Missing at Random) means the missing data depends on other observed information.
For example, people with higher incomes might be less likely to skip questions about
financial spending than those with lower incomes.
MissForest, to impute values for the missing entries.
MISSING NOT AT
RANDOM
MNAR (Missing Not at Random) means the missing data is related to hidden factors. For
example, people who have cheated might avoid answering a survey question about
cheating.
almost impossible to handle.

8
Mean/Median/Mode Imputation
? Missing Data Nature: Confirmed as Missing Completely at
Random (MCAR).
? Extent of Missing Data: Limited to a maximum of 5% per
variable.
? Imputation Technique for Categorical Variables: Utilize
mode imputation for the most frequent category.
? Imputation Data Source: Calculate mean, median, or mode
exclusively from the training dataset to prevent data
leakage and maintain validation/test set integrity.

9
Regression Imputation �C Miss Forest
? Another great application of Random Forest!
? Assume Data Missing At Random.
? Utilizes entire dataset's information for imputation,
enhancing the predictive accuracy of imputed values over
simple mean/median/mode imputation

10
Regression Imputation �C Miss Forest
Iterative Approach:
1.First, fill missing values with a simple method (e.g., the mean).
2.Pick one column with missing data, use the available data to train a Random Forest model, and predict the
missing values.
3.Move to the next column and repeat the process.
4.Continue this cycle until the missing values stop changing significantly or after 5-6 rounds.

11
MissForest vs Zero or Mean Imputation
? If computational resources are not a
limitation, prefer MissForest over simple
imputations like zero or mean, which can
distort the dataset's original distribution

12
Q & A
Suppose I train a KNN feature classifier without scaling the
features. For instance, one feature ranges from -1000 to 1000,
while another ranges from -0.001 to 0.001.
What potential issues could arise?

13
Feature Scaling Examples - KNN
Without normalization, all the nearest neighbors will be biased to feature with larger
range(x2) leading to incorrect classification.

14
Feature Scaling Examples - KNN
Feature scaling can lead to completely different model in terms of
decision boundary

15
Feature Scaling
? Use when different numeric features have different scales
(different range of values)
? Features with much higher values may overpower the others
? Goal: bring them all within the same range
? Especially Important for the following models:
? KNN: Distances depend mainly on feature with larger values
? SVMs: (kernelized) dot products are also based on distances
? Linear model: Feature scale affects regularization. Converge Faster!

16
Feature Scaling
Standard
Scalar
Normalizes features to a standard Gaussian distribution.
Centers the mean at 0 with a standard deviation of 1.
Formula: x_scaled = (x �C mean) / std_dev
Use when data distribution is assumed to be normal.
Min-Max
Scaler:
Scales
features to
a given
range,
often [0, 1].
Scales features to a given range, often [0, 1]. ��
Transforms all data points proportionally within the range
x_scaled = (x �C x_min) / (x_max �C x_min)
Use for scaling within a bounded range.

17
But how to handle feature scaling with
outliers?
Question: What is median? What is 75th percentile?
Robust Scaler: Reduces the influence of outliers on scaling.
? Centers using the median and scales using the IQR.
? x_scaled = (x �C median) / IQR
? Use when outliers are present and need to be mitigated.
? IQR Calculation: IQR = Q3 �C Q1 (the difference between the 75th percentile (Q3) and the 25th
percentile (Q1) in a dataset)

18
Q & A
? Suppose you have a dataset with categorical features, such as 'dog'
and 'cat'. Logistic regression, however, cannot directly handle
categorical features.
? To make these features compatible with the model, we might encode
'dog' as '0' and 'cat' as '1'. Is this a good approach? Why or why not?

19
Categorical Feature Encoding
? Ordinal encoding
? For example, ��Jan, Feb, Mar, Apr��
? Simply assigns an integer value to each category in the order they are
encountered
? Only really useful if there exist a natural order in categories
? Model will consider one category to be ��higher�� or ��closer�� to another

20
Categorical Feature Encoding �C One Hot
Encoding
? One-hot encoding (dummy encoding)
? For example, ��Cat, Dog, ��
? Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
? Can explode if a feature has lots of values, causing issues with high dimensionality
? What if test set contains a new category not seen in training data?
? Either ignore it (just use all 0��s in row), or handle manually (eg. imputation)

21
Model Validation Scheme
? Always evaluate models as if they are predicting future data
? We do not have access to future data, so we pretend that some data
is hidden
? Simplest way: the holdout (simple train-val-test split) if dataset is
sufficiently large
? Randomly split data (and corresponding labels) into training and test set (e.g.
60%-20%-20%)
? Train (fit) a model on the training data and tweak it on the validation data,
then score on the test data

22
Q & A
? What are issues with simple train-val-test split, when dataset is really
small?

23
K-Fold Cross Validation
? Each random split can yield very different
models (and scores)
? e.g. all easy (of hard) examples could end up in
the test set
? Split data into k equal-sized parts, called folds
? Create k splits, each time using a different fold
as the test set
? Compute k evaluation scores, aggregate
afterwards (e.g. take the mean)
? Examine the score variance to see
how sensitive (unstable) models are
? Large k gives better estimates (more training
data), but is expensive

24
K-Fold Cross Validation for Hyperparameter
Tuning
? After we obtained best
hyperparameters (models)
using cross validation, we
can further apply it on a
separate test data
? In our coursework: we use
simple train-val-test for
simplicity, but you can
also try this as additional
technique

25
K-Fold Cross Validation for Model Ensembling
? We can create model ensemble using K-Fold Cross
Validation
? One of the most common used tricks in Kaggle

26
Model Evaluation
? We have a positive and a
negative class
? 2 different kind of errors:
? False Positive : model predicts
positive while true label is
negative
? False Negative: model predicts
negative while true label is
positive

27
Q&A
? Suppose someone has cancer but was not diagnosed (missed
detection).
? Suppose someone was healthy but was diagnosed with cancer (false
detection).
? What are the consequences? Which situation is more serious?

28
Binary Model Evaluation �C Confusion Matrix
? We can represent all predictions (correct
and incorrect) in a confusion matrix
? n by n array (n is the number of classes)
? Rows correspond to true classes, columns to
predicted classes
? Count how often samples belonging to a
class C are classified as C or any other class.
? For binary classification, we label these true
negative (TN), true positive (TP), false
negative (FN), false positive (FP)

29
Binary Model Evaluation �C Precision, Recall
and F1
? Precision: use when the goal is to
limit FPs
? Clinical trails: you only want to test
drugs that really work
? Search engines: you want to avoid bad
search results
? Recall: Use when the goal is to
limit FNs
? Cancer diagnosis: you don��t want
to miss a serious disease
? Search engines: You don��t want to
omit important hits
? F1-score: Trades off precision
and recall:

30
Multi-class Evaluations
? Train models per class : one class viewed
as positive, other(s) also negative, then
calculate metrics per class, you can get a
per-class evaluation score.
? Micro-averaging: count total TP, FP, TN,
FN (every sample equally important)
? Macro-averaging: average of
scores obtained on each class
? Preferable for imbalanced classes (if all
classes are equally important)
? macro-averaged recall is also
called balanced accuracy
? Weighted averaging

31
Summary
? We discuss various feature engineering techniques, including feature
scaling, missing value imputation, outlier handling and categorial
feature encoding
? We discuss the model selection and evaluation procedure, specifically
cross-validation and evaluation metrics.

�ݺ�ߣ

Build_Machine_Learning_System for Machine Learning Course

Recommended

More Related Content

Similar to Build_Machine_Learning_System for Machine Learning Course (20)

More from ssuserfece35 (7)

Recently uploaded (20)

Build_Machine_Learning_System for Machine Learning Course

Editor's Notes