2. Outline
● Deep Learning
● AutoEncoder
● Convolutional Neural Network (CNN)
● Recurrent Neural Network (RNN)
● Long Short Term Memory (LSTM)
● Gated Recurrent Unit (GRU)
● Attention Mechanism
● Few NLP Applications
2
3. Few key terms to start with
● Neurons
● Layers
○ Input, Output and Hidden
● Activation functions
○ Sigmoid, Tanh, Relu
● Softmax
● Weight matrices
○ Input → Hidden, Hidden → Hidden, Hidden → Output
● Backpropagation
○ Optimizers
■ Gradient Descent (GD), Stochastic Gradient Descent (SGD), Adam etc.
○ Error (Loss) functions
■ Mean-Squared Error, Cross-Entropy etc.
○ Gradient of error
○ Passes: Forward pass and Backward pass
3
4. History of Neural Network and Deep learning
● Neural Network and Perceptron learning algorithm: [McCulloch and Pitts
(1943), Rosenblatt (1957)]
● Backpropagation: Rumelhart, Hinton and Williams, 1986
○ Theoretically, a neural network can have any number of hidden layers.
○ But, in practice, it rarely had more than one layer hidden layers.
■ Computational issue: Limited computing power
■ Algorithmical issues: Vanishing gradient and Exploding gradient.
● Beginning of Deep learning: Late 1990’s and early 2000’s
○ Solutions:
■ Computational issue: Advance computing powers such as GPUs, TPUs
■ Algorithmical issues
● Pre-training (e.g., AutoEncoder, RBM)
● Better architectures (e.g., LSTM)
● Better activation functions (e.g., Relu) 4
5. Deep Learning vs Machine Learning Paradigm
● The main advantage of deep learning based approaches is the trainable
features, i.e., it extracts relevant features, on its own, during training.
● Requires minimal human intervention.
5
6. Why Deep Learning?
● Recall, artificial neural network tries to
mimic the functionality of a brain.
● In brain, computations happen in layers.
● View of representation
○ As we go up in the network, we get high-level
representations ? Assists in performing more
complex tasks.
6
7. Why Deep Architectures were hard to train?
● General weight-updation rule
● For lower-layers in deep architecture
○ δj
will vanish, if it is less than 1
○ δj
will explode, if it is more than 1
7
14. Deep Learning Architectures
● Convolutional neural network (CNN)
○ Aims to extract the local spatial features
● Recurrent neural network (RNN)
○ Exploits the sequential information of a sentence (sentence is a sequence of words).
14
16. Convolutional Neural Networks (CNN)
● A CNN consists of a series (≥ 1) of convolution layer and pooling layer.
● Convolutional operation extracts the feature representations from the input
data.
○ Shares the convolution filters over different spatial locations, in a quest of extracting
location-invariant features in the input.
○ Shape and weights of the convolution filter determine the features to be extracted from the
input data.
○ In general, multiple filters of different shapes are used to ensure the diversity in the extracted
features.
● Pooling operation extracts the most relevant features from the convoluted
features. Similar to downsampling in image-processing.
16
19. Recurrent Neural Network (RNN)
● A neural network with feedback connections
● Enable networks to do temporal processing
● Good at learning sequences
● Acts as memory unit
19
20. RNN - Example 1
Part-of-speech tagging:
● Given a sentence X, tag each word its corresponding grammatical class.
20
23. How to train RNNs?
● Typical FFN
○ Backpropagation algorithm
● RNNs
○ A variant of backpropagation algorithm namely Back-Propagation Through Time (BPTT).
23
24. BackPropagation Through Time (BPTT)
Error for an instance = Sum of errors at each time step of the instance
Gradient of error
24
36. O0
O1
O2
W W
W
U U U
X0 X1
X2
O-1
Ot
Ot-1
U
Xt
View 3
View 4
W
U
U
W
X0
X1
U
W
X2
W
O1
O0
O2
W
W
W U
U
U
X
0
X
1
X
2
O-1
=0 View 1
View 2
Different views
O1
O0
O2
O-1
=0
36
38. Usage
● Depends on the problems that we aim to solve.
● Typically good for sequence processings.
● Some sort of memorization is required.
38
39. Bit reverse problem
● Problem definition:
○ Problem 1: Reverse a binary digit.
■ 0 → 1 and 1 → 0
○ Problem 2: Reverse a sequence of binary digits.
■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0
■ Sequence: Fixed or Variable length
○ Problem 3: Reverse a sequence of bits over time.
■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0
○ Problem 4: Reverse a bit if the current i/p and previous o/p are same.
Input sequence 1 1 0 0 1 0 0 0 1 1
Output sequence 1 0 1 0 1 0 1 0 1 0
39
40. Data
Let
● Problem 1
○ I/p dimension: 1 bit O/p dimension: 1 bit
● Problem 2
○ Fixed
■ I/p dimension: 10 bit O/p dimension: 10 bit
○ Variable: Pad each sequence upto max sequence length: 10
■ Padding value: -1
■ I/p dimension: 10 bit O/p dimension: 10 bit
● Problem 3 & 4
○ Dimension of each element of I/p (X) : 1 bit
○ Dimension of each element of O/p (O) : 1 bit
○ Sequence length : 10
40
41. Network Architecture
Problem 1:
● I/p neurons = 1
● O/p neurons = 1
W W
W
U U
Ot Ot-1
U
Xt
O-1
O10
W
U
X10
….
No. of I/p neurons = I/p dimension
No. of O/p neurons = O/p dimension
Problem 2: Fixed & Variable
● I/p neurons = 10
● O/p neurons = 10
W
O
U
X
U
X0
O0
O1
O10
X1
X10
Problem 3:
● I/p neurons = 1
● O/p neurons = 1
● Seq len = 10
U
Xt
= X10
, … , X1
, X0
Ot
= O10
, … , O1
,
O0
Problem 4:
● I/p neurons = 1
● O/p neurons = 1
● Seq len = 10
….
U
X0
O0
O1
O10
X1
X10
….
41
X0
O0
O1
X1
42. Different configurations of RNNs
Image
Captioning
Sentiment
Analysis
Machine
Translation
Language
modelling 42
46. ● Cue word for the prediction
○ Example 1: sky → clouds [3 units apart]
○ Example 2: hindi → India [9 units apart]
● As the sequence length increases, it becomes hard for RNNs to learn
“long-term dependencies.”
○ Vanishing gradients: If weights are small, gradient shrinks exponentially. Network stops
learning.
○ Exploding gradients: If weights are large, gradient grows exponentially. Weights fluctuate
and become unstable.
Vanishing/Exploding gradients
46
53. Gated Recurrent Unit (GRU) [Cho et al. (2014)]
● A variant of simple RNN (Vanilla RNN)
● Similar to LSTM
○ Whatever LSTM can do GRU can also do.
● Differences
○ Cell state and hidden are merged together
○ Two gates
■ Reset gate - similar to forget
■ Update gate - similar to input gate
○ No output gate
○ Cell/Hidden state is completely exposed to subsequent units.
● GRU needs fewer parameters to learn and is relatively efficient w.r.t.
computation.
53
56. NLP hierarchy
● Like deep learning, NLP happens in layers!
● Each task receives features from its previous (lower-level) task, and process
them to produce its own output and so on.
56
59. RNN/LSTM/GRU for Sequence Labelling
Part-of-speech tagging:
● Given a sentence X, tag each word its corresponding grammatical class.
59
60. ● Sentence matrix
● Pad sentence to ensure the
sequence length
○ Pad length = filter_size - 1
○ Evenly distribute padding at the
start and end of the sequence.
● Apply Convolution filters
● Classification
CNN for Sequence Labelling
60
62. RNN/LSTM/GRU for Sentence Classification
Sentiment Classification:
● Given a sentence X, identify the expressed sentiment.
62
63. Zhang, Y., Wallace, B. ; A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classi?cation; In Proceedings of the 8th International Joint
Conference on Natural Language Processing (IJCNLP-2017); pages 253-263; Taipie, Taiwan; 2017.
1. Sentence matrix
a. embeddings of words
2. Convolution filters
a. Total 6 filters; Two each of size
2, 3 & 4.
b. 1 feature maps for each filter
3. Pooling
a. 1-max pooling
4. Concatenate the max-pooled vector
5. Classification
a. Softmax
CNN for Sentence Classification
63
66. Why sequence transformation is required?
● For many application length of I/p and O/p are not necessarily same. E.g.
Machine Translation, Summarization, Question Answering etc.
● For many application length of O/p is not known.
● Non-monotone mapping: Reordering of words.
● Applications for which sequence transformation is not require
○ PoS tagging,
○ Named Entity Recognition
○ ....
66
67. Encode-Decode paradigm
Decoder
Encoder
Ram eats mango
??? ?? ????
<eos>
?? <eos>
● English-Hindi Machine Translation
○ Source sentence: 3 words
○ Target sentence: 4 words
○ Second word of the source sentence maps to 3rd & 4th words of the target sentence.
○ Third word of the source sentence maps to 2nd word of the target sentence
67
68. Problems with Encode-Decode paradigm
● Encoding transforms the entire sentence into a single vector.
● Decoding process uses this sentence representation for predicting the output.
○ Quality of prediction depends upon the quality of sentence embeddings.
● After few time steps decoding process may not properly use the sentence
representation due to long-term dependency.
68
69. Solutions
● To improve the quality of predictions we can
○ Improve the quality of sentence embeddings ‘OR’
○ Present the source sentence representation for prediction at each time step. ‘OR’
○ Present the RELEVANT source sentence representation for prediction at each time step.
■ Encode - Attend - Decode (Attention mechanism)
69
70. Attention Mechanism
● Represent the source sentence by the set of output vectors from the
encoder.
● Each output vector (OV) at time t is a contextual representation of the input
at time t.
Ram eats mango <eos>
OV1 OV2 OV3 OV4
70
71. Attention Mechanism
● Each of these output vectors (OVs) may not be equally relevant during
decoding process at time t.
● Weighted average of the output vectors can resolve the relevancy.
○ Assign more weights to an output vector that needs more attention during decoding at time t.
● The weighted average context vector (CV) will be the input to decoder along
with the sentence representation.
○ CVi
= ∑j
aij
. OVj
where aij
is the attn-wt of the jth
OV
71
82. ? Word2vec [Mikolov et al., 2013]
– Contextual model
– Two variants
? Skip-gram
? Continuous Bag-of-word
? GloVe [Pennington et al., 2014]
– Co-occurrence matrix
– Matrix Factorization
? FastText [Bojanowski et al., 2016]
– Similar to word2vec
– Works on sub-word level
? Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2018]
– Based on Transformer model
? Embeddings from Language Models (ELMo) [Peters et al., 2018]
– Contextual
? The representation for each word depends on the entire context in which it is used.
83. Few good reads..
● Denny Britz; Recurrent Neural Networks Tutorial, Part 1-4
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introducti
on-to-rnns/
● Andrej Karpathy; The Unreasonable E?ectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-e?ectiveness/
● Chris Olah; Understanding LSTM Networks
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
83
85. Workshop on
AI for Computational Social Systems (ACSS)
Sunday, 9th Feb 2020
(http://lcs2.iiitd.edu.in/acss2020/)
Organizer
Laboratory for Computational Social Systems (LCS2) @ IIIT Delhi.
85
Registration Fee
Rs. 200/-