Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Maninda Edirisooriya
?
Model Testing and Evaluation is a lesson where you learn how to train different ML models with changes and evaluating them to select the best model out of them. This was one of the lectures of a full course I taught in University of Moratuwa, Sri Lanka on 2023 second half of the year.
Top 10 Data Science Practitioner PitfallsSri Ambati
?
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, Mark Landry, one of the world¡¯s leading Kagglers, will review the top 10 common pitfalls and steps to avoid them.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Top 10 Data Science Practioner Pitfalls - Mark LandrySri Ambati
?
Over-fitting, misread data, NAs, collinear column elimination and other common issues play havoc in the day of practicing data scientist. In this talk, we review top 10 common pitfalls and steps to avoid them. #h2ony
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Experimental Design for Distributed Machine Learning with Myles BakerDatabricks
?
This document discusses experimental design for distributed machine learning models. It outlines common problems in machine learning modeling like selecting the best algorithm and evaluating a model's expected generalization error. It describes steps in a machine learning study like collecting data, building models, and designing experiments. The goal of experimentation is to understand how model factors affect outcomes and obtain statistically significant conclusions. Techniques discussed for analyzing distributed model outputs include precision-recall curves, confusion matrices, and hypothesis testing methods like the chi-squared test and McNemar's test. The document emphasizes that experimental design for distributed learning poses new challenges around data characteristics, computational complexity, and reproducing results across models.
Top 10 Data Science Practitioner PitfallsSri Ambati
?
Top 10 Data Science Practitioner Pitfalls Meetup with Erin LeDell and Mark Landry on 09.09.15
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
?
This document discusses statistical learning and model selection. It introduces statistical learning problems, statistical models, the need for statistical modeling, and issues around evaluating models. Key points include: statistical learning involves using data to build a predictive model; a good model balances bias and variance to minimize prediction error; cross-validation is described as the ideal procedure for evaluating models without overfitting to the test data.
The document discusses different techniques for cross-validation in machine learning. It defines cross-validation as a technique for validating model efficiency by training on a subset of data and testing on an unseen subset. It then describes various cross-validation methods like hold out validation, k-fold cross-validation, leave one out cross-validation, and their implementation in scikit-learn.
The document discusses machine learning and various machine learning concepts. It defines learning as improving performance through experience. Machine learning involves using data to acquire models and learn hidden concepts. The main areas covered are supervised learning (data with labels), unsupervised learning (data without labels), semi-supervised learning (some labels present), and reinforcement learning (agent takes actions and receives rewards/punishments). Decision trees are presented as a way to represent hypotheses learned through examples, with attributes used to recursively split data into partitions.
This document discusses machine learning and various machine learning techniques. It begins by defining learning and different types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning. It then focuses on supervised learning, discussing important concepts like training and test sets. Decision trees are presented as a popular supervised learning technique, including how they are constructed using a top-down recursive approach that chooses attributes to best split the data based on measures like information gain. Overfitting is also discussed as an issue to address with techniques like pruning.
FINBOURNE engineer and Machine Learning specialist Jack Wright presentation on an 'introduction to machine learning'?.
Topics covered:
What is a learning process and how can machines do it?
Do you understand the difference between empirical and true loss?
How and why do machine learning algorithms go awry?
This presentation uses visual examples to demonstrate how machine learning algorithms work and the principles they¡¯re based on and brings it all together with a worked demo on a real dataset. It goes from ¡°what is learning¡± through to regularisation and model selection.
An introduction to machine learning and statisticsSpotle.ai
?
This document provides an overview of machine learning and predictive modeling. It begins by describing how predictive models can be used in various domains like healthcare, finance, telecom, and business. It then discusses the differences between machine learning and predictive modeling, noting that machine learning aims to allow machines to learn autonomously using feedback mechanisms, while predictive modeling focuses on building statistical models to predict outcomes. The document also uses examples like Microsoft's Tay chatbot to illustrate how machine learning systems can be exposed to real-world data to continuously learn and improve. It concludes by explaining how predictive analytics fits within machine learning as the starting point to build initial predictive models and continuously monitor and refine them.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
?
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Application of Machine Learning in AgricultureAman Vasisht
?
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
The document discusses machine learning and various machine learning techniques. It defines machine learning as using data and experience to acquire models and modify decision mechanisms to improve performance. The document outlines different types of machine learning including supervised learning (using labeled data), unsupervised learning (using only unlabeled data), and reinforcement learning (where an agent takes actions and receives rewards or punishments). It provides examples of classification problems and discusses decision tree learning as a supervised learning method, including how decision trees are constructed and potential issues like overfitting.
AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase ¡°my curve is better than your curve¡± the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist¡¯s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly¡¯s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of ¡°indifference curves¡± (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
?
The document discusses various concepts related to machine learning models including prediction errors, overfitting, underfitting, bias, variance, hyperparameter tuning, and regularization techniques. It provides explanations of key terms and challenges in machine learning like the curse of dimensionality. Cross-validation methods like k-fold are presented as ways to evaluate model performance on unseen data. Optimization algorithms such as gradient descent and stochastic gradient descent are covered. Regularization techniques like Lasso, Ridge, and Elastic Net are introduced.
This document discusses various methods for evaluating machine learning models. It describes splitting data into training, validation, and test sets to evaluate models on large datasets. For small or unbalanced datasets, it recommends cross-validation techniques like k-fold cross-validation and stratified sampling. The document also covers evaluating classifier performance using metrics like accuracy, confidence intervals, and lift charts, as well as addressing issues that can impact evaluation like overfitting and class imbalance.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
The document describes the 8 step data mining process:
1) Defining the problem, 2) Collecting data, 3) Preparing data, 4) Pre-processing, 5) Selecting an algorithm and parameters, 6) Training and testing, 7) Iterating models, 8) Evaluating the final model. It discusses issues like defining classification vs estimation problems, selecting appropriate inputs and outputs, and determining when sufficient data has been collected for modeling.
This document discusses feature engineering, which is the process of transforming raw data into features that better represent the underlying problem for predictive models. It covers feature engineering categories like feature selection, feature transformation, and feature extraction. Specific techniques covered include imputation, handling outliers, binning, log transforms, scaling, and feature subset selection methods like filter, wrapper, and embedded methods. The goal of feature engineering is to improve machine learning model performance by preparing proper input data compatible with algorithm requirements.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
The document discusses different techniques for cross-validation in machine learning. It defines cross-validation as a technique for validating model efficiency by training on a subset of data and testing on an unseen subset. It then describes various cross-validation methods like hold out validation, k-fold cross-validation, leave one out cross-validation, and their implementation in scikit-learn.
The document discusses machine learning and various machine learning concepts. It defines learning as improving performance through experience. Machine learning involves using data to acquire models and learn hidden concepts. The main areas covered are supervised learning (data with labels), unsupervised learning (data without labels), semi-supervised learning (some labels present), and reinforcement learning (agent takes actions and receives rewards/punishments). Decision trees are presented as a way to represent hypotheses learned through examples, with attributes used to recursively split data into partitions.
This document discusses machine learning and various machine learning techniques. It begins by defining learning and different types of machine learning, including supervised learning, unsupervised learning, and reinforcement learning. It then focuses on supervised learning, discussing important concepts like training and test sets. Decision trees are presented as a popular supervised learning technique, including how they are constructed using a top-down recursive approach that chooses attributes to best split the data based on measures like information gain. Overfitting is also discussed as an issue to address with techniques like pruning.
FINBOURNE engineer and Machine Learning specialist Jack Wright presentation on an 'introduction to machine learning'?.
Topics covered:
What is a learning process and how can machines do it?
Do you understand the difference between empirical and true loss?
How and why do machine learning algorithms go awry?
This presentation uses visual examples to demonstrate how machine learning algorithms work and the principles they¡¯re based on and brings it all together with a worked demo on a real dataset. It goes from ¡°what is learning¡± through to regularisation and model selection.
An introduction to machine learning and statisticsSpotle.ai
?
This document provides an overview of machine learning and predictive modeling. It begins by describing how predictive models can be used in various domains like healthcare, finance, telecom, and business. It then discusses the differences between machine learning and predictive modeling, noting that machine learning aims to allow machines to learn autonomously using feedback mechanisms, while predictive modeling focuses on building statistical models to predict outcomes. The document also uses examples like Microsoft's Tay chatbot to illustrate how machine learning systems can be exposed to real-world data to continuously learn and improve. It concludes by explaining how predictive analytics fits within machine learning as the starting point to build initial predictive models and continuously monitor and refine them.
H2O World - Top 10 Data Science Pitfalls - Mark LandrySri Ambati
?
H2O World 2015 - Mark Landry
Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Application of Machine Learning in AgricultureAman Vasisht
?
With the growing trend of machine learning, it is needless to say how machine learning can help reap benefits in agriculture. It will be boon for the farmer welfare.
The document discusses machine learning and various machine learning techniques. It defines machine learning as using data and experience to acquire models and modify decision mechanisms to improve performance. The document outlines different types of machine learning including supervised learning (using labeled data), unsupervised learning (using only unlabeled data), and reinforcement learning (where an agent takes actions and receives rewards or punishments). It provides examples of classification problems and discusses decision tree learning as a supervised learning method, including how decision trees are constructed and potential issues like overfitting.
AUC is and has been an extremely powerful lens through which machine learning practitioners have been able to evaluate and compare model performance. Is the phrase ¡°my curve is better than your curve¡± the right threshold for publishing a new paper or pushing a new model into production? In this talk, I will demonstrate the ways in which we at Remitly are thinking outside the box (and the area under the curve) to challenge whether or not AUC is the right metric for a range of applications. Price and cost are fundamental components of economic modeling, and are quintessential aspects of an economist¡¯s education and economic way of thinking. These are foreign concepts for many machine learning practitioners. Remitly¡¯s Data Science team manages and thinks deeply about a number of classification tasks such as risk management and fraud detection. For a number of these tasks, misclassification is extremely costly compared to the gains of a correct classification. We are willing to sacrifice AUC in order to incorporate costs of classification and misclassification into our loss functions. By incorporating the notion of ¡°indifference curves¡± (i.e., level sets), we show that by choosing models whose ROC curves cross our indifference curve thresholds, we can aim for models that give us the best bang for our buck.
MACHINE LEARNING YEAR DL SECOND PART.pptxNAGARAJANS68
?
The document discusses various concepts related to machine learning models including prediction errors, overfitting, underfitting, bias, variance, hyperparameter tuning, and regularization techniques. It provides explanations of key terms and challenges in machine learning like the curse of dimensionality. Cross-validation methods like k-fold are presented as ways to evaluate model performance on unseen data. Optimization algorithms such as gradient descent and stochastic gradient descent are covered. Regularization techniques like Lasso, Ridge, and Elastic Net are introduced.
This document discusses various methods for evaluating machine learning models. It describes splitting data into training, validation, and test sets to evaluate models on large datasets. For small or unbalanced datasets, it recommends cross-validation techniques like k-fold cross-validation and stratified sampling. The document also covers evaluating classifier performance using metrics like accuracy, confidence intervals, and lift charts, as well as addressing issues that can impact evaluation like overfitting and class imbalance.
This document discusses various techniques for machine learning when labeled training data is limited, including semi-supervised learning approaches that make use of unlabeled data. It describes assumptions like the clustering assumption, low density assumption, and manifold assumption that allow algorithms to learn from unlabeled data. Specific techniques covered include clustering algorithms, mixture models, self-training, and semi-supervised support vector machines.
The document describes the 8 step data mining process:
1) Defining the problem, 2) Collecting data, 3) Preparing data, 4) Pre-processing, 5) Selecting an algorithm and parameters, 6) Training and testing, 7) Iterating models, 8) Evaluating the final model. It discusses issues like defining classification vs estimation problems, selecting appropriate inputs and outputs, and determining when sufficient data has been collected for modeling.
This document discusses feature engineering, which is the process of transforming raw data into features that better represent the underlying problem for predictive models. It covers feature engineering categories like feature selection, feature transformation, and feature extraction. Specific techniques covered include imputation, handling outliers, binning, log transforms, scaling, and feature subset selection methods like filter, wrapper, and embedded methods. The goal of feature engineering is to improve machine learning model performance by preparing proper input data compatible with algorithm requirements.
Data preprocessing is the process of preparing raw data for analysis by cleaning it, transforming it, and reducing it. The key steps in data preprocessing include data cleaning to handle missing values, outliers, and noise; data transformation techniques like normalization, discretization, and feature extraction; and data reduction methods like dimensionality reduction and sampling. Preprocessing ensures the data is consistent, accurate and suitable for building machine learning models.
Introduction to Machine Learning Lecturesssuserfece35
?
This lecture discusses ensemble methods in machine learning. It introduces bagging, which trains multiple models on random subsets of the training data and averages their predictions, in order to reduce variance and prevent overfitting. Bagging is effective because it decreases the correlation between predictions. Random forests apply bagging to decision trees while also introducing more randomness by selecting a random subset of features to consider at each node. The next lecture will cover boosting, which aims to reduce bias by training models sequentially to focus on examples previously misclassified.
Infervision is a company that uses artificial intelligence to help doctors by automatically recognizing symptoms on medical images and recommending treatments. Their goal is to make top medical expertise available to everyone by reducing the burden on doctors and improving access to healthcare in rural areas. They have developed powerful AI models for various diseases by combining deep learning with medical data from partner hospitals in China. Their products help generate diagnostic reports and can screen for diseases to improve efficiency and lower healthcare costs.
Transformer in Medical Imaging A brief reviewssuserfece35
?
Transformers show promise for medical imaging tasks by enabling long-range modeling via self-attention. Two papers presented techniques using transformers for robust fovea localization and multi-lesion segmentation. The first used a transformer block to fuse retinal image and vessel features. The second used relation blocks modeling interactions between lesions and between lesions and vessels, improving hard-to-segment lesion detection. Efficient transformers like Swin and deformable sampling were also discussed, enabling long-range modeling with reduced complexity for 3D tasks. Overall, transformers appear well-suited for medical imaging by capturing global context but efficient techniques are needed for 3D applications.
How to Install Odoo 18 with Pycharm - Odoo 18 ºÝºÝߣsCeline George
?
In this slide we¡¯ll discuss the installation of odoo 18 with pycharm. Odoo 18 is a powerful business management software known for its enhanced features and ability to streamline operations. Built with Python 3.10+ for the backend and PostgreSQL as its database, it provides a reliable and efficient system.
General College Quiz conducted by Pragya the Official Quiz Club of the University of Engineering and Management Kolkata in collaboration with Ecstasia the official cultural fest of the University of Engineering and Management Kolkata.
Design approaches and ethical challenges in Artificial Intelligence tools for...Yannis
?
The recent technology of Generative Artificial Intelligence (GenAI) has undeniable advantages, especially with regard to improving the efficiency of all stakeholders in the education process.
At the same time, almost all responsible international organisations and experts in the field of education and educational technology point out a multitude of general ethical problems that need to be addressed. Many of these problems have already arisen in previous models of artificial intelligence or even in systems based on learning data, and several are appearing for the first time.
In this short contribution, we will briefly review some dimensions of ethical problems, both (a) the general ones related to trust, transparency, privacy, personal data security, accountability, environmental responsibility, bias, power imbalance, etc., and (b) the more directly related to teaching, learning, and education, such as students' critical thinking, the social role of education, the development of teachers' professional competences, etc.
In addition, the categorizations of possible service allocation to humans and AI tools, the human-centered approach to designing AI tools and learning data, as well as the more general design of ethics-aware applications and activities will be briefly presented. Finally, some short illustrative examples will be presented to set the basis for the debate in relation to ethical and other dilemmas.
This slides provide you the information regarding the sexually transmitted diseases as well as about the urinary tract infection. The presentation is based on the syllabus of Bachelor of Pharmacy semester 6 of subject name Pharmacology-III. The data is occupied from the high standard books and along with easy understanding of data.
? Marketing is Everything in the Beauty Business! ??? Talent gets you in the ...coreylewis960
?
? Marketing is Everything in the Beauty Business! ???
Talent gets you in the game¡ªbut visibility keeps your chair full.
Today¡¯s top stylists aren¡¯t just skilled¡ªthey¡¯re seen.
That¡¯s where MyFi Beauty comes in.
? We Help You Get Noticed with Tools That Work:
? Social Media Scheduling & Strategy
We make it easy for you to stay consistent and on-brand across Instagram, Facebook, TikTok, and more.
You¡¯ll get content prompts, captions, and posting tools that do the work while you do the hair.
?? Your Own Personal Beauty App
Stand out from the crowd with a custom app made just for you. Clients can:
Book appointments
Browse your services
View your gallery
Join your email/text list
Leave reviews & refer friends
?? Offline Marketing Made Easy
We provide digital flyers, QR codes, and branded business cards that connect straight to your app¡ªturning strangers into loyal clients with just one tap.
? The Result?
You build a strong personal brand that reaches more people, books more clients, and grows with you. Whether you¡¯re just starting out or trying to level up¡ªMyFi Beauty is your silent partner in success.
A Systematic Review:
Provides a clear and transparent process
? Facilitates efficient integration of information for rational decision
making
? Demonstrates where the effects of health care are consistent and
where they do vary
? Minimizes bias (systematic errors) and reduce chance effects
? Can be readily updated, as needed.
? Meta-analysis can provide more precise estimates than individual
studies
? Allows decisions based on evidence , whole of it and not partial
Different perspectives on dugout canoe heritage of Soomaa.pdfAivar Ruukel
?
Sharing the story of haabjas to 1st-year students of the University of Tartu MA programme "Folkloristics and Applied Heritage Studies" and 1st-year students of the Erasmus Mundus Joint Master programme "Education in Museums & Heritage".
General Quiz at Maharaja Agrasen College | Amlan Sarkar | Prelims with Answer...Amlan Sarkar
?
Prelims (with answers) + Finals of a general quiz originally conducted on 13th November, 2024.
Part of The Maharaja Quiz - the Annual Quiz Fest of Maharaja Agrasen College, University of Delhi.
Feedback welcome at amlansarkr@gmail.com
How to Setup Company Data in Odoo 17 Accounting AppCeline George
?
The Accounting module in Odoo 17 is a comprehensive tool designed to manage all financial aspects of a business. It provides a range of features that help with everything from day-to-day bookkeeping to advanced financial analysis.
3. 3
Q & A
In practical machine learning roles, what percentage of time do you
think is typically spent on data preparation and feature engineering?
(A) 20%
(B) 40%
(C) 60%
(D) 80%
4. 4
Data Preparation and Feature Engineering
The features you use influence more than everything else the result. No algorithm
alone, to my knowledge, can supplement the information gain given by
correct feature engineering.
¡ª Luca Massaron
5. 5
Q&A
? How would you handle missing values in a table? Fill with zeros or use
other methods? What issues might arise from filling with zeros?
6. 6
Different types of missing values
? 3 Main Types of Missing Data | Do THIS Before Handling Missing Valu
es! ¨C YouTube
7. 7
Missing Value Imputation
MISSING COMPLETELY
AT RANDOM
MCAR (Missing Completely at Random) means the missing data is random and doesn't
depend on anything else. For example, if survey answers are accidentally skipped or if a
person simply chooses not to answer a question.
Mean / Median/Mode Imputation, Random Sample Imputation
MISSING AT RANDOM MAR (Missing at Random) means the missing data depends on other observed information.
For example, people with higher incomes might be less likely to skip questions about
financial spending than those with lower incomes.
MissForest, to impute values for the missing entries.
MISSING NOT AT
RANDOM
MNAR (Missing Not at Random) means the missing data is related to hidden factors. For
example, people who have cheated might avoid answering a survey question about
cheating.
almost impossible to handle.
8. 8
Mean/Median/Mode Imputation
? Missing Data Nature: Confirmed as Missing Completely at
Random (MCAR).
? Extent of Missing Data: Limited to a maximum of 5% per
variable.
? Imputation Technique for Categorical Variables: Utilize
mode imputation for the most frequent category.
? Imputation Data Source: Calculate mean, median, or mode
exclusively from the training dataset to prevent data
leakage and maintain validation/test set integrity.
9. 9
Regression Imputation ¨C Miss Forest
? Another great application of Random Forest!
? Assume Data Missing At Random.
? Utilizes entire dataset's information for imputation,
enhancing the predictive accuracy of imputed values over
simple mean/median/mode imputation
10. 10
Regression Imputation ¨C Miss Forest
Iterative Approach:
1.First, fill missing values with a simple method (e.g., the mean).
2.Pick one column with missing data, use the available data to train a Random Forest model, and predict the
missing values.
3.Move to the next column and repeat the process.
4.Continue this cycle until the missing values stop changing significantly or after 5-6 rounds.
11. 11
MissForest vs Zero or Mean Imputation
? If computational resources are not a
limitation, prefer MissForest over simple
imputations like zero or mean, which can
distort the dataset's original distribution
12. 12
Q & A
Suppose I train a KNN feature classifier without scaling the
features. For instance, one feature ranges from -1000 to 1000,
while another ranges from -0.001 to 0.001.
What potential issues could arise?
13. 13
Feature Scaling Examples - KNN
Without normalization, all the nearest neighbors will be biased to feature with larger
range(x2) leading to incorrect classification.
14. 14
Feature Scaling Examples - KNN
Feature scaling can lead to completely different model in terms of
decision boundary
15. 15
Feature Scaling
? Use when different numeric features have different scales
(different range of values)
? Features with much higher values may overpower the others
? Goal: bring them all within the same range
? Especially Important for the following models:
? KNN: Distances depend mainly on feature with larger values
? SVMs: (kernelized) dot products are also based on distances
? Linear model: Feature scale affects regularization. Converge Faster!
16. 16
Feature Scaling
Standard
Scalar
Normalizes features to a standard Gaussian distribution.
Centers the mean at 0 with a standard deviation of 1.
Formula: x_scaled = (x ¨C mean) / std_dev
Use when data distribution is assumed to be normal.
Min-Max
Scaler:
Scales
features to
a given
range,
often [0, 1].
Scales features to a given range, often [0, 1]. ¡¢
Transforms all data points proportionally within the range
x_scaled = (x ¨C x_min) / (x_max ¨C x_min)
Use for scaling within a bounded range.
17. 17
But how to handle feature scaling with
outliers?
Question: What is median? What is 75th percentile?
Robust Scaler: Reduces the influence of outliers on scaling.
? Centers using the median and scales using the IQR.
? x_scaled = (x ¨C median) / IQR
? Use when outliers are present and need to be mitigated.
? IQR Calculation: IQR = Q3 ¨C Q1 (the difference between the 75th percentile (Q3) and the 25th
percentile (Q1) in a dataset)
18. 18
Q & A
? Suppose you have a dataset with categorical features, such as 'dog'
and 'cat'. Logistic regression, however, cannot directly handle
categorical features.
? To make these features compatible with the model, we might encode
'dog' as '0' and 'cat' as '1'. Is this a good approach? Why or why not?
19. 19
Categorical Feature Encoding
? Ordinal encoding
? For example, ¡°Jan, Feb, Mar, Apr¡±
? Simply assigns an integer value to each category in the order they are
encountered
? Only really useful if there exist a natural order in categories
? Model will consider one category to be ¡®higher¡¯ or ¡®closer¡¯ to another
20. 20
Categorical Feature Encoding ¨C One Hot
Encoding
? One-hot encoding (dummy encoding)
? For example, ¡°Cat, Dog, ¡¡±
? Simply adds a new 0/1 feature for every category, having 1 (hot) if the sample has that category
? Can explode if a feature has lots of values, causing issues with high dimensionality
? What if test set contains a new category not seen in training data?
? Either ignore it (just use all 0¡¯s in row), or handle manually (eg. imputation)
21. 21
Model Validation Scheme
? Always evaluate models as if they are predicting future data
? We do not have access to future data, so we pretend that some data
is hidden
? Simplest way: the holdout (simple train-val-test split) if dataset is
sufficiently large
? Randomly split data (and corresponding labels) into training and test set (e.g.
60%-20%-20%)
? Train (fit) a model on the training data and tweak it on the validation data,
then score on the test data
22. 22
Q & A
? What are issues with simple train-val-test split, when dataset is really
small?
23. 23
K-Fold Cross Validation
? Each random split can yield very different
models (and scores)
? e.g. all easy (of hard) examples could end up in
the test set
? Split data into k equal-sized parts, called folds
? Create k splits, each time using a different fold
as the test set
? Compute k evaluation scores, aggregate
afterwards (e.g. take the mean)
? Examine the score variance to see
how sensitive (unstable) models are
? Large k gives better estimates (more training
data), but is expensive
24. 24
K-Fold Cross Validation for Hyperparameter
Tuning
? After we obtained best
hyperparameters (models)
using cross validation, we
can further apply it on a
separate test data
? In our coursework: we use
simple train-val-test for
simplicity, but you can
also try this as additional
technique
25. 25
K-Fold Cross Validation for Model Ensembling
? We can create model ensemble using K-Fold Cross
Validation
? One of the most common used tricks in Kaggle
26. 26
Model Evaluation
? We have a positive and a
negative class
? 2 different kind of errors:
? False Positive : model predicts
positive while true label is
negative
? False Negative: model predicts
negative while true label is
positive
27. 27
Q&A
? Suppose someone has cancer but was not diagnosed (missed
detection).
? Suppose someone was healthy but was diagnosed with cancer (false
detection).
? What are the consequences? Which situation is more serious?
28. 28
Binary Model Evaluation ¨C Confusion Matrix
? We can represent all predictions (correct
and incorrect) in a confusion matrix
? n by n array (n is the number of classes)
? Rows correspond to true classes, columns to
predicted classes
? Count how often samples belonging to a
class C are classified as C or any other class.
? For binary classification, we label these true
negative (TN), true positive (TP), false
negative (FN), false positive (FP)
29. 29
Binary Model Evaluation ¨C Precision, Recall
and F1
? Precision: use when the goal is to
limit FPs
? Clinical trails: you only want to test
drugs that really work
? Search engines: you want to avoid bad
search results
? Recall: Use when the goal is to
limit FNs
? Cancer diagnosis: you don¡¯t want
to miss a serious disease
? Search engines: You don¡¯t want to
omit important hits
? F1-score: Trades off precision
and recall:
30. 30
Multi-class Evaluations
? Train models per class : one class viewed
as positive, other(s) also negative, then
calculate metrics per class, you can get a
per-class evaluation score.
? Micro-averaging: count total TP, FP, TN,
FN (every sample equally important)
? Macro-averaging: average of
scores obtained on each class
? Preferable for imbalanced classes (if all
classes are equally important)
? macro-averaged recall is also
called balanced accuracy
? Weighted averaging
31. 31
Summary
? We discuss various feature engineering techniques, including feature
scaling, missing value imputation, outlier handling and categorial
feature encoding
? We discuss the model selection and evaluation procedure, specifically
cross-validation and evaluation metrics.
Editor's Notes
#2: In today¡¯s lecture, we'll explore the workflow of a typical machine learning system, encompassing preprocessing of raw data, feature scaling, encoding, discretization, label imbalance correction, feature selection, dimensionality reduction, and learning and evaluation, which includes acknowledging algorithm biases, model selection, data splitting, and ultimately prediction, to form an integrated pipeline that prepares data for learning and optimizes model performance.
#4: What is a feature and why we need the engineering of it? Basically, all machine learning algorithms use some input data to create outputs. This input data comprise features, which are usually in the form of structured columns. Algorithms require features with some specific characteristic to work properly. Here, the need for?feature engineering?arises. I think feature engineering efforts mainly have two goals:
Preparing the proper input dataset, compatible with the machine learning algorithm requirements.
Improving the performance of machine learning models.
According to a survey in Forbes, data scientists spend?80%?of their time on?data preparation.
#7: In data analysis, dealing with missing data is a common challenge, and the approach to imputation often depends on the nature of the missingness. With Missing Completely at Random (MCAR), the absence of data is independent of both observed and unobserved data, akin to survey responses being accidentally skipped during data entry or respondents choosing to leave a question blank without any systematic bias. Simple methods like imputing the mean, median, or mode, or using a random sample, can be effective for MCAR since the missingness does not introduce a systemic distortion. In contrast, Missing at Random (MAR) occurs when the propensity for missing data is related to other, observed data. For instance, a pattern where one gender omits answers to questions about parental leave more frequently than the other. In such cases, more sophisticated techniques like MissForest can be employed, which predict missing values using patterns found in other variables. However, when dealing with Missing Not at Random (MNAR), the problem becomes more complex, as the missing data is related to factors not captured in the dataset. For example, individuals who have been unfaithful may avoid questions about fidelity. Addressing MNAR effectively is particularly challenging because the very nature of the missing data is obscured by unobserved influences, resisting standard imputation methods.
#9: Assuming data is Missing at Random (MAR), one can utilize more sophisticated imputation methods that leverage the entire dataset, potentially resulting in greater predictive accuracy than would be achieved with simple mean, median, or mode imputation.
#10: An iterative approach begins by filling in missing values with a basic imputation, such as the mean of the observed values. The dataset is then split: a portion with complete data is used for training, while the subset with previously imputed values is treated as the target for prediction. A Random Forest algorithm, known for its robustness, is applied to predict the missing values, which are then updated with these predictions. This cycle is repeated, progressively refining the quality of the data with each iteration, until the imputations stabilize and no significant changes occur, or until a predetermined number of iterations is reached. This iterative process ensures that each round of imputation benefits from the enhanced patterns and relationships uncovered in the data from the previous round.
#13: Without normalization of feature scales, machine learning algorithms that rely on distance calculations, such as K-Nearest Neighbors (KNN), can be significantly biased. In cases where one feature has a much smaller range than others, the nearest neighbors tend to be affected stronger by x2?. This misalignment occurs because the larger-scaled features overpower the smaller ones, causing the distance metric to be skewed in favor of the larger ranges.
Consequently, this leads to incorrect classification or prediction results, as the model essentially overlooks the contributions of features with smaller ranges. Normalization ensures that each feature contributes equally to the distance computations, thereby preventing the axis with the smaller range from disproportionately determining the nearest neighbors.
#15: When your data has numeric features that vary widely in scale, some features with higher numerical values might dominate over others during the modeling process, skewing the results. The aim is to level the playing field by bringing every feature into the same range of values. This step is particularly crucial for models like K-Nearest Neighbors (KNN), where the calculation of distances can be heavily biased towards the feature with the larger scale. Similarly, Support Vector Machines (SVMs) rely on dot products when kernelized, which again depend on the distances between data points. Even in linear models, the scale of features can influence how regularization is applied. Normalizing or standardizing these features ensures that each one contributes equally to the model's performance, allowing for a more accurate and fair analysis.
#16: In data preprocessing, the Standard Scaler is a tool that normalizes features to fit a standard Gaussian distribution, aligning the mean at 0 and standard deviation at 1, using the transformation??scaled=(??mean)std_devxscaled?=std_dev(x?mean)?. This is particularly effective when the data is assumed to follow a normal distribution. On the other hand, the Min-Max Scaler adjusts features to fall within a specific range, typically between 0 and 1, according to the formula??scaled=(???min)(?max??min)xscaled?=(xmax??xmin?)(x?xmin?)?, ensuring all values are proportionately adjusted within this bounded interval, which is ideal for when scaling needs to adhere to a predefined range.
#17: . The Robust Scaler, however, is designed to be insensitive to outliers by using the median and the interquartile range (IQR) for centering and scaling, respectively, as expressed by??scaled=(??median)IQRxscaled?=IQR(x?median)?, where IQR is the range between the 75th and 25th percentiles. This scaler is useful in datasets where outliers could skew the scaling process.
The median is the middle number when you put all your numbers in order. The 25% percentile is the value where one-quarter of the numbers are below it.
#19: Ordinal encoding and one-hot encoding are two methods for converting categorical data into numerical form for machine learning models. Ordinal encoding assigns an integer to each category based on the order they are encountered, which is beneficial if the categories have a natural ranking¡ªfor instance, the months "Jan, Feb, Mar, Apr" could be encoded as 1, 2, 3, 4, respectively. However, this technique implies a hierarchy where some categories are considered 'higher' or 'closer' to others, which may not always be appropriate. On the other hand, one-hot encoding creates a new binary feature for each category, which is set to 1 (hot) if the sample belongs to that category, and 0 otherwise, as seen with categories like "Cat, Dog." While this method avoids implying any order, it can result in a high number of features, especially if the categorical variable has many unique values, leading to high dimensionality problems. Additionally, if new categories appear in the test set that weren't present in the training data, they must be either ignored or handled manually, such as through imputation, to ensure the model can process them.