Data clustering, data deduction and data visualization. Using advnaced skills to encode the free format articles to cluster data by using LLM pre-trained models.
Data clustering, data deduction and data visualization. Using advnaced skills to encode the free format articles to cluster data by using LLM pre-trained models.
3. Contents
o Orange Workflow Overview.
o Visualizing data from input data file.
o Analyzing data with regression models and decision trees.
o Analyzing data with deep learning models.
o Text processing and classification.
4. Weekly Learning Outcomes
1. Understand Data workflow in orange.
2. Inspect data with orange visualisation.
3. Develop predictive models in orange, including regression, decision trees
and deep learning models.
4. Perform Textual data analysis with orange.
5. Assess quality of various predication methods.
5. Required Reading
? Orange Visual Programming Documentation (Release 3). Orange (2021).
https://buildmedia.readthedocs.org/media/pdf/orange-visual-programming/latest/orange-visual-
programming.pdf
? Chapter 1 – section 1.2, 1.3 (subsection: 1-2), 1.4, 1.5, Chapter 2 – section 2.1 (subsection: 1-10, 34) 2.2
(subsection: 1,3,4,5,16) 2.3 (subsection: 5,9,10,13) and section 2.4, (subsection: 2,4,6)
? AJDA. (2017, August 4). Text Analysis: New Features. Orangedatamining.Com.
https://orangedatamining.com/blog/2017/08/04/text-analysis-new-features/
Recommended Readings
? Zupan, D. (2018, May). Introduction to Data Mining: Working notes for the hands-on course with Orange Data Mining. University of
Ljubljana. https://file.biolab.si/notes/2018-05-intro-to-datamining-notes.pdf
? Lesson: 1,2,3,4,5,6,7,8,10,14,17,31, 32.
? Foong, N. W. (2019, August 7). Data Science Made Easy: Interactive Data Visualization using Orange. Medium,
Towardsdatascience.com. https://towardsdatascience.com/data-science-made-easy-interactive-data-visualization-using-orange-
de8d5f6b7f2b
? Analytics Vidhya, A. (2017, September 7). Building Machine Learning Model is fun using Orange. Analytics Vidhya.
https://www.analyticsvidhya.com/blog/2017/09/building-machine-learning-model-fun-using-orange/
? Foong, N. W. (2019b, August 29). Data Science Made Easy: Image Analytics using Orange. Medium, Towardsdatascience.com.
https://towardsdatascience.com/data-science-made-easy-image-analytics-using-orange-ad4af375ca7a
This Presentation is mainly dependent on the above recourses.
6. Recommended Videos
? Getting Started with Orange 01: Welcome to Orange (2015, December 21). [Video]. YouTube.
https://www.youtube.com/channel/UClKKWBe2SCAEyv7ZNGhIe4g
? Orange Data Mining tool. (2016, May 5). [Video]. YouTube.
https://www.youtube.com/watch?v=rrsRBSCHDXw
? Getting Started with Orange 16: Text Preprocessing (2017, Jun 20). [Video]. YouTube.
youtube.com/watch?v=V70UwJZWkZ8
? Text Mining: Twitter Data Analysis (2020, August 4). [Video]. YouTube.
https://www.youtube.com/watch?v=HDkI6G4slzQ
? Getting Started with Orange 18: Text Classification (2017, Jun 28). [Video]. YouTube.
https://www.youtube.com/watch?v=zO_zwKZCULo
8. Orange workflow
“Orange is a component-based data mining software. It includes a range of data
visualization, exploration, preprocessing and modeling techniques. It can be used
through a nice and intuitive user interface or, for more advanced users, as a
module for the Python programming language.”(Orange official GitHub page).
9. Orange workflow
? The core principle of Orange is visual programming.
? The basic processing unit of any data manipulation in Orange are called widgets.
? Each analytical step/action is contained within a widget.
? Widgets communicate by sending information along with a communication channel and the output from
one widget is used as input to another.
? A workflow is the sequence of steps/actions that is performed to accomplish a particular task.
? Widgets are are placed on the canvas and connected into an analytical workflow.
? Orange analytical workflow is executed from left to right and never passes data backwards.
? Orange workflows consist of components that
? Read
? Process
? Visualize data
10. Simple Orange workflow - Files and Data
Tables
? File widget: reads the data.
? Data Table widget: a viewer
and shows the data in a
spreadsheet. It passes
onwards only the selection.
? The data is always available
in the File widget.
Workflow with two connected widgets
11. Simple Orange workflow - Files and Data
Tables
? Most Orange workflows would probably start with the File widget.
? Orange can import any comma, .xlsx or tab-delimited data file or URL.
Example:
File widget is used to read the data that is sent to both the Data Table and the Box
Plot widgets.
12. Workflows with subsets
? Visualizations in Orange are interactive, which means the user can select data
instances from the plot and pass them downstream.
Example:
? Selecting subsets
? Step 1: Place File widget on the canvas.
? Step 2: Connect Scatter Plot to it.
? Step 3: Click and drag a rectangle around a subset of points.
? Step 4: Connect Data Table to Scatter Plot ? Data Table will show selected points.
? Highlighting workflows
? Connect Data Table to Scatter Plot.
? Select a subset of points from the Data Table ? Scatter Plot will highlight selected points.
13. Workflows - data exploration
? Feature Statistics widget provides a quick way to inspect and find interesting
features in a given data set.
? Example: Heart-disease data exploration
? Select a subset of potentially interesting
features, or simply select the features we
want to keep.
? The widget will outputs a new data set with
only these selected feature.
14. Workflows with Models
? Predictive models are evaluated in
Test and Score widget.
? Test and Score accepts several
inputs:
1. Data (data set for evaluating
models).
2. Learners (algorithms to use for
training the model).
3. Optional preprocessor (for
normalization or feature selection).
15. Workflows with Models
The widget does two things:
1. It shows evaluation results (results of
testing different
classification/regression algorithms).
2. It outputs evaluation results, which can
be used by other widgets for analysing
the performance of classifiers, such
as confusion matrix.
Sampling setting (e.g., performs cross-validation or
some other train-and-test procedures).
16. Workflows with Models - Evaluation
? Confusion matrix widget: show
proportions between the predicted and
actual class.
Inputs:
? Evaluation results: results of testing
classification algorithms.
Outputs:
? Selected Data: data subset selected from
confusion matrix.
? Data: data with the additional information on
whether a data instance was selected.
The test results are fed into the Confusion
Matrix, where we can observe how many
instances were misclassified and in which way.
17. Workflows with Models
? Predictions on new data are done
in Predictions:
? The training data is first passed
to the model.
? Once the model is trained, it is
passed to Predictions.
? The Predictions widget also
needs data to predict on (Test
data), which are passed as a
second input.
18. Workflows with Models
? Predictive models can be saved and
reused in different Orange.
Workflows.
? To save a model:
1. Models first require data for
training.
2. They output a trained model,
which can be saved with Save
Model widget in the pickle format.
? Trained model can be loaded and
used in Predictions and/or
elsewhere.
23. Visualizations in Orange
? Exercise: Build a simple workflow with File and Scatter Plot for Iris dataset.
The scatter plot is showing x-axis (petal
width) and length for three species of
Iris flowers (y-axis). The relation
between them increase linearly.
24. Visualizations in Orange
? Scatter Plot supports zooming-in and out of part of the plot and a manual
selection of data instances.
? Example: Explorative data analysis using Iris dataset.
1. Selected data instances from
a rectangular region on
Scatter plot.
2. Sent them to the Data Table
widget.
3. Explore the relationship
between any two variables.
25. Visualizations in Orange
? Basic Data Exploration from input data file.
Example: heart_disease
How does the data look?
26. Cont…
Example: heart_disease
Explore the data with standard visualizations tell us anything interesting!
? The Box Plot widget is most commonly used immediately
after the File widget to observe the statistical properties of
a dataset and discover any anomalies, such as duplicated
values, outliers, …)
? The Scatter Plot widget provides a 2-dimensional scatter
plot visualization for continuous attributes.
? The Distributions widget displays the value distribution of
discrete or continuous attributes.
27. Cont…
Example: heart_disease
Box plot for attribute 'age' grouped by 'gender'
Max - HR decreases with age.
Distribution of 'chest pain' with columns split by 'gender'
28. Cont…
Example: heart_disease
? Data can also be split by the value of features and analyse it separately.
? Split data by gender – use select Rows widget
Choose the female patients in Select Rows widget
Selection of data instances works well with visualisation of data
distribution and explore the data.
29. Visualizations in Orange
? Reports
? Reports allow to trace back analytical steps as it saves the workflow at which each
report segment was created.
? Reports can be saved in .html, .pdf or .report format.
31. Analyzing data with decision
trees
? Decision tree is is one of the oldest, but still popular, machine learning
methods.
? Decision trees workflow example
? Decision trees in Orange does not use any data pre-processing.
32. Analyzing data with decision
trees
Tree viewer
? This widget cab be used for or
visualizing decision trees.
? To enable explorative data
analysis, Select a node, which
instruct the widget to output the
data associated with that node.
? If both the viewer and Tree are
open, any re-run of the tree
induction algorithm will
immediately affect the
visualization.
Explore how the parameters of the decision
tree algorithm influence the structure of the
resulting tree.
33. Analyzing data with decision
trees
Tree parameters:
? Induce binary tree: build a binary tree (split into two
child nodes).
? Min. number of instances in leaves: if checked, the
algorithm will never construct a split which would put
less than the specified number of training examples
into any of the branches.
? Do not split subsets smaller than: forbids the
algorithm to split the nodes with less than the given
number of instances.
? Limit the maximal tree depth: limits the depth of the
classification tree to the specified number of node
levels.
? Stop when majority reaches [%]: stop splitting the
nodes after a specified majority threshold is reached.
Explore how the parameters of the
decision tree algorithm influence the
structure of the resulting tree.
34. Analyzing data with decision
trees
Example: Using sailing data, predict the conditions under which a
friend skipper went sailing
Load the data ? Build a tree ? visualize it in the Tree Viewer
35. Analyzing data with decision trees
Example: Using sailing data, predict the conditions under which a
friend skipper went sailing
? Trees place the most useful feature at the root.
? The most useful feature is the feature that splits the
data into two purest possible subsets.
? These are then split further, again by the most
informative features.
? This process of breaking up the data subsets to
smaller ones repeats until we reach subsets where all
data belongs to the same class.
? These subsets are represented by leaf nodes in
strong blue or red.
? The process of data splitting can also terminate when
it runs out of data instances or out of useful features
(the two leaf nodes in white).
36. Analyzing data with decision
trees
Example: Using sailing data, predict the conditions under which a
friend skipper went sailing
According to the decision tree results:
? It looks like this skipper is a social
person; as soon as there’s company,
the probability of her sailing increases.
? When joined by a smaller group of
individuals, there is no sailing if there is
rain (Thunderstorms? Too dangerous?)
? When she has a smaller company, but
the boat at her disposal is big, there is
no sailing either.
37. Analyzing data with decision
trees
Example: Using sailing data, predict the conditions under which a
friend skipper went sailing
? What are the most the most useful” feature?
? Rank widget - estimates the quality of data features and ranks them
according to how much information they carry.
38. Analyzing data with decision
trees
Model inspection example
? To inspect a model, combine Tree and Scatter Plot widgets to display instances
taken from a chosen decision tree node.
Iris dataset: Model inspection example
39. Analyzing data with decision
trees
? Decision trees works for regression tasks.
Decision tree for housing dataset example
40. Prediction models
? Predictions widget will shows the data, but makes no predictions.
? To analyse data with prediction model, a predictive model is needed.
? The Predictions widget uses the model to make predictions about the data and shows them in the
table
? E.g.:
1. The data is fed into the model widget to infer a
predictive model.
2. The Predictions widget gets the data from the
File widget and also a predictive model from
the model widget.
Model widget is channel that carries a model
41. Analyzing data with regression
models
Two regression models are available:
? Liner Regression works with continuous data.
? Logistic Regression only works for classification tasks (It learns a Logistic Regression
model from the data).
Example: Demonstrate prediction results with logistic regression on hayes-roth
dataset.
Traning:
1. First load hayes-roth dataset to File widget.
2. Pass the data to Logistic Regression model for training.
3. Pass the trained model to Prediction widget.
Testing: predict class value on a new dataset
1. Load hayes-roth_test in the second File widget
2. Connect it to Predictions.
3. Observe class values predicted with Logistic Regression directly in Predictions.
42. Analyzing data with regression
models
Example: Predict Iris flower type using Logistic Regression.
43. Analyzing data with regression
models
? Linear Regression widget constructs a
learner/predictor that learns a linear
function from its input data.
? The model can identify the relationship
between a predictor x and the response
variable y.
? Linear regression works only on regression
tasks.
44. Analyzing data with regression
models
? Example: Train a Linear Regression on housing dataset and evaluated its
performance in Test & Score.
46. Analyzing data with deep learning
models
? Neural Network widget is a multi-layer perceptron (MLP) algorithm with
backpropagation.
Inputs
? Data: input dataset
? Preprocessor: preprocessing method(s)
Outputs
? Learner: multi-layer perceptron learning algorithm
? Model: trained model
neural network with
3 layers can be
defined as 2, 3, 2
47. Analyzing data with deep learning
models
? Neural Network uses default preprocessing when no other preprocessors
are given. It executes them in the following order:
1. Removes instances with unknown target values
2. Continuizes categorical variables (with one-hot-encoding)
3. Removes empty columns
4. Imputes missing values with mean values
5. Normalizes the data by centering to mean and scaling to standard deviation of 1
? To remove default preprocessing, connect an empty Preprocess widget to
the learner.
48. Analyzing data with deep learning
models
Example: Neural Network Workflow for classification task on the iris data.
49. Analyzing data with deep learning
models
Example: Neural Network Workflow for a prediction task on the iris data.
1. Input the Neural Network prediction model into Predication.
2. Observe the predicted values.
50. Analyzing data deep learning
models
Example: image analytics workflow on domestic
animal image dataset using Image Analytics add-
on.
1. Import the image data via the Import
Images widget.
2. display all of the loaded images using Image
Viewer widget.
3. For image data analysis Image embeddings
widget must be used as classification and
regressions tasks accept data in the form of
numbers.
The most important parameters for the Image
Embedding interface is the Embedder. Supported Deep
network embeddings: SqueezeNet, Inception v3, VGG-16,
VGG-19, Painters, Deeploc.
51. Analyzing data deep learning models
Image Embedding widget convert images to a vectors of numbers.
52. Analyzing data deep learning models
? Image Grid widget display images from
a dataset in a similarity grid such that
images with similar content are placed
closer to each other.
? Image Grid widget can be used for
image comparison, while looking for
similarities or discrepancies between
selected data instances.
53. Analyzing data deep learning models
Example: Workflow for image analytics
for Classification task on on animal
image dataset.
1. Pass Image Embedding to Test and
Score.
2. Use Neural Network learner with 3
layers with 10 neurons each.
3. Input the learner into Test and Score.
4. Observe the predicted values.
54. Analyzing data deep learning models
Example: Workflow for image analytics for Classification task on on animal
image dataset.
The model provided 94% accuracy
Misclassification: the model predicted the
image as cat instead of dog.
55. Analyzing data deep learning
models
Example: Workflow for image
analytics for Classification task on on
animal image dataset.
Use image viewer to investigate the
Misclassification example
Select the misclassified example
Error justification: in the
misclassified image dog
looks like a cat.
57. Textual data analysis
? Orange support textual data analysis through Text add-on.
? Common text widgets:
? Text preprocessing: preprocessing text (e.g., removing stopworks, lowercase, …).
? Corpus viewer: to view corpus content.
? Sentiment Analysis: enables basic sentiment analysis of corpora.
? Topic Modelling: uncover the hidden thematic structure in a corpus.
? Word Cloud: display word frequency.
? Typical textual data analysis workflows:
58. Textual data analysis
Typical text pre-process workflow example:
This workflow uses simple reprocessing for
creating tokens from documents:
1. it applies lowercase
2. splits text into words
3. it removes frequent stopwords.
Results of preprocessing can be
observe in a Word Cloud
59. Textual data analysis
Typical Sentiment workflow example:
Yellow represent a high, positive score, while blue
represent a low, negative score.
60. Textual data analysis
Typical Topic modelling workflow example:
Topic Modelling, for example, colors words by their
weights - positive weights are colored green and
negative red.
Uncover latent topics in the data
61. Textual data analysis
? Tweets are a valuable source of information, for social scientists, marketing
managers, linguists, economists, and so on.
Twitter Data Analysis workflow example:
62. Text classification
? Predictive models can be used to classify
documents by authorship, their type,
sentiment and so on.
Text classification workflow example:
? Data: Grimm tales data
? Task: classify documents by their topic of the
tale.
? Predication models: Logistic Regression and
Decision Tree.
Load Grimm tales data
63. Text classification
Text classification workflow example:
Given to tales of different class Logistic
regression can correctly distinguish between
them in over 90% of the cases. Better than
Decision Tree!
65. Predicative model quality
Example: Explore the performance of different predictive models on iris
dataset.
Logistic Regression outperformer other classifiers
66. Main Reference
? Orange Visual Programming Documentation (Release 3). Orange (2021).
https://buildmedia.readthedocs.org/media/pdf/orange-visual-programming/latest/orange-visual-
programming.pdf
? AJDA. (2017, August 4). Text Analysis: New Features. Orangedatamining.Com.
https://orangedatamining.com/blog/2017/08/04/text-analysis-new-features/
? Zupan, D. (2018, May). Introduction to Data Mining: Working notes for the hands-on course with
Orange Data Mining. University of Ljubljana. https://file.biolab.si/notes/2018-05-intro-to-
datamining-notes.pdf
? Foong, N. W. (2019, August 7). Data Science Made Easy: Interactive Data Visualization using
Orange [Post]. Medium, Towardsdatascience.com. https://towardsdatascience.com/data-science-
made-easy-interactive-data-visualization-using-orange-de8d5f6b7f2b
? Analytics Vidhya, A. (2017, September 7). Building Machine Learning Model is fun using Orange.
Analytics Vidhya. https://www.analyticsvidhya.com/blog/2017/09/building-machine-learning-model-
fun-using-orange/
? Orange Data Mining. Workflows, accessed 10 August 2021
https://orangedatamining.com/workflows/
? Foong, N. W. (2019b, August 29). Data Science Made Easy: Image Analytics using Orange.
Medium, Towardsdatascience.com. https://towardsdatascience.com/data-science-made-easy-
image-analytics-using-orange-ad4af375ca7a
This Presentation is mainly dependent on the above recourses.
67. Week self-review exercises
? Download orange and understand its workflow:
? https://orangedatamining.com/getting-started/
? https://orangedatamining.com/workflows/
? Hands-on practice on using orange for data analysis, visualisation, developing predictive models
and textual data analysis: https://orangedatamining.com/blog/