狠狠撸

狠狠撸Share a Scribd company logo
Convolutional and Recurrent
Neural Networks
Outline
● Deep Learning
● AutoEncoder
● Convolutional Neural Network (CNN)
● Recurrent Neural Network (RNN)
● Long Short Term Memory (LSTM)
● Gated Recurrent Unit (GRU)
● Attention Mechanism
● Few NLP Applications
2
Few key terms to start with
● Neurons
● Layers
○ Input, Output and Hidden
● Activation functions
○ Sigmoid, Tanh, Relu
● Softmax
● Weight matrices
○ Input → Hidden, Hidden → Hidden, Hidden → Output
● Backpropagation
○ Optimizers
■ Gradient Descent (GD), Stochastic Gradient Descent (SGD), Adam etc.
○ Error (Loss) functions
■ Mean-Squared Error, Cross-Entropy etc.
○ Gradient of error
○ Passes: Forward pass and Backward pass
3
History of Neural Network and Deep learning
● Neural Network and Perceptron learning algorithm: [McCulloch and Pitts
(1943), Rosenblatt (1957)]
● Backpropagation: Rumelhart, Hinton and Williams, 1986
○ Theoretically, a neural network can have any number of hidden layers.
○ But, in practice, it rarely had more than one layer hidden layers.
■ Computational issue: Limited computing power
■ Algorithmical issues: Vanishing gradient and Exploding gradient.
● Beginning of Deep learning: Late 1990’s and early 2000’s
○ Solutions:
■ Computational issue: Advance computing powers such as GPUs, TPUs
■ Algorithmical issues
● Pre-training (e.g., AutoEncoder, RBM)
● Better architectures (e.g., LSTM)
● Better activation functions (e.g., Relu) 4
Deep Learning vs Machine Learning Paradigm
● The main advantage of deep learning based approaches is the trainable
features, i.e., it extracts relevant features, on its own, during training.
● Requires minimal human intervention.
5
Why Deep Learning?
● Recall, artificial neural network tries to
mimic the functionality of a brain.
● In brain, computations happen in layers.
● View of representation
○ As we go up in the network, we get high-level
representations ? Assists in performing more
complex tasks.
6
Why Deep Architectures were hard to train?
● General weight-updation rule
● For lower-layers in deep architecture
○ δj
will vanish, if it is less than 1
○ δj
will explode, if it is more than 1
7
Layer-wise pre-training
8
AutoEncoder
9
AutoEncoder: Layer 1
10
z = f(x),
where z ≈ x
z1
z2
z3
z4
z5
AutoEncoder: Layer 2
11
Weights
frozen
AutoEncoder: Layer 3
12
Weights
frozen
Weights
frozen
AutoEncoder: Pre-trained network
13
Deep Learning Architectures
● Convolutional neural network (CNN)
○ Aims to extract the local spatial features
● Recurrent neural network (RNN)
○ Exploits the sequential information of a sentence (sentence is a sequence of words).
14
Convolutional Neural Network
LeCunn and Bengio (1995)
15
Convolutional Neural Networks (CNN)
● A CNN consists of a series (≥ 1) of convolution layer and pooling layer.
● Convolutional operation extracts the feature representations from the input
data.
○ Shares the convolution filters over different spatial locations, in a quest of extracting
location-invariant features in the input.
○ Shape and weights of the convolution filter determine the features to be extracted from the
input data.
○ In general, multiple filters of different shapes are used to ensure the diversity in the extracted
features.
● Pooling operation extracts the most relevant features from the convoluted
features. Similar to downsampling in image-processing.
16
CNN
17
Recurrent Neural Network (RNN)
18
Recurrent Neural Network (RNN)
● A neural network with feedback connections
● Enable networks to do temporal processing
● Good at learning sequences
● Acts as memory unit
19
RNN - Example 1
Part-of-speech tagging:
● Given a sentence X, tag each word its corresponding grammatical class.
20
RNN - Example 2
●
●
●
○
○
○
○
21
Training of RNNs
22
How to train RNNs?
● Typical FFN
○ Backpropagation algorithm
● RNNs
○ A variant of backpropagation algorithm namely Back-Propagation Through Time (BPTT).
23
BackPropagation Through Time (BPTT)
Error for an instance = Sum of errors at each time step of the instance
Gradient of error
24
BackPropagation Through Time (BPTT)
For V
For W (Similarly for U)
25
Visualization of RNN through
Feed-Forward Neural Network
26
Problem, Data and Network Architecture
● Problem:
○ I/p sequence (X) : X0
, X1
, …, XT
○ O/p sequence (O) : O0
, O1
, …, OT
● Representation of data:
○ I/p dimension : 4
■ X0
→ 0 1 1 0
○ O/p dimension : 3
■ O0
→ 0 0 1
● Network Architecture
○ Number of neurons at I/p layer : 4
○ Number of neurons at O/p layer : 3
○ Do we need hidden layers?
■ If yes, number of neurons at each hidden layers
27
U
X0
O0
t
0
Network @ t = 0
28
U
U
X0
X1
O0
O1
t
0
1
Network @ t = 1
29
U
U
W
X0
X1
O0
O1
t
0
1
Network @ t = 1
30
U
U
W
X0
X1
O0
O1
O1
= f(W.O0
+ U.X1
)
= f([W, U] . [O0
, x1
])
t
0
1
Network @ t = 1
31
U
U
W
X0
X1
U
W
X2
O2
O2
= f(W.O1
+ U.X2
)
= f([W, U] . [O1
, x2
])
t
0
1
2
O0
O1
Network @ t = 2
32
U
U
W
X0
X1
O1
O0
U
W
X2
O2
W
O-1
=0
Complete Network
33
U
U
W
X0
X1
U
W
X2
W
View 1
O1
O0
O2
O-1
=0
Different views of the network
34
U
U
W
X0
X1
U
W
X2
W
O1
O0
O2
W
W
W U
U
U
X
0
X
1
X
2
O-1
=0 View 1
View 2
Different views
O1
O0
O2
O-1
=0
35
O0
O1
O2
W W
W
U U U
X0 X1
X2
O-1
Ot
Ot-1
U
Xt
View 3
View 4
W
U
U
W
X0
X1
U
W
X2
W
O1
O0
O2
W
W
W U
U
U
X
0
X
1
X
2
O-1
=0 View 1
View 2
Different views
O1
O0
O2
O-1
=0
36
When to use RNNs
37
Usage
● Depends on the problems that we aim to solve.
● Typically good for sequence processings.
● Some sort of memorization is required.
38
Bit reverse problem
● Problem definition:
○ Problem 1: Reverse a binary digit.
■ 0 → 1 and 1 → 0
○ Problem 2: Reverse a sequence of binary digits.
■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0
■ Sequence: Fixed or Variable length
○ Problem 3: Reverse a sequence of bits over time.
■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0
○ Problem 4: Reverse a bit if the current i/p and previous o/p are same.
Input sequence 1 1 0 0 1 0 0 0 1 1
Output sequence 1 0 1 0 1 0 1 0 1 0
39
Data
Let
● Problem 1
○ I/p dimension: 1 bit O/p dimension: 1 bit
● Problem 2
○ Fixed
■ I/p dimension: 10 bit O/p dimension: 10 bit
○ Variable: Pad each sequence upto max sequence length: 10
■ Padding value: -1
■ I/p dimension: 10 bit O/p dimension: 10 bit
● Problem 3 & 4
○ Dimension of each element of I/p (X) : 1 bit
○ Dimension of each element of O/p (O) : 1 bit
○ Sequence length : 10
40
Network Architecture
Problem 1:
● I/p neurons = 1
● O/p neurons = 1
W W
W
U U
Ot Ot-1
U
Xt
O-1
O10
W
U
X10
….
No. of I/p neurons = I/p dimension
No. of O/p neurons = O/p dimension
Problem 2: Fixed & Variable
● I/p neurons = 10
● O/p neurons = 10
W
O
U
X
U
X0
O0
O1
O10
X1
X10
Problem 3:
● I/p neurons = 1
● O/p neurons = 1
● Seq len = 10
U
Xt
= X10
, … , X1
, X0
Ot
= O10
, … , O1
,
O0
Problem 4:
● I/p neurons = 1
● O/p neurons = 1
● Seq len = 10
….
U
X0
O0
O1
O10
X1
X10
….
41
X0
O0
O1
X1
Different configurations of RNNs
Image
Captioning
Sentiment
Analysis
Machine
Translation
Language
modelling 42
Problems with RNNs
43
Language modelling: Example - 1
?
44
Language modelling: Example - 2
?
45
● Cue word for the prediction
○ Example 1: sky → clouds [3 units apart]
○ Example 2: hindi → India [9 units apart]
● As the sequence length increases, it becomes hard for RNNs to learn
“long-term dependencies.”
○ Vanishing gradients: If weights are small, gradient shrinks exponentially. Network stops
learning.
○ Exploding gradients: If weights are large, gradient grows exponentially. Weights fluctuate
and become unstable.
Vanishing/Exploding gradients
46
RNN extensions
● Bi-directional RNN
● Deep (Bi-directional) RNN
47
Long Short Term Memory (LSTM)
Hochreiter & Schmidhuber (1997)
48
LSTM
● A variant of simple RNN (Vanilla RNN)
● Capable of learning long dependencies.
● Regulates information flow from recurrent units.
49
Vanilla RNN vs LSTM
50
An LSTM cell
51
● Cell state ct
(blue arrow), hidden state ht
(green arrow) and input xt
(red arrow)
● Three gates
○ Forget (Red-dotted box)
○ Input (Green-dotted box)
○ Output (Blue-dotted box)
Gated Recurrent Units (GRU)
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, Yoshua Bengio (2014)
52
Gated Recurrent Unit (GRU) [Cho et al. (2014)]
● A variant of simple RNN (Vanilla RNN)
● Similar to LSTM
○ Whatever LSTM can do GRU can also do.
● Differences
○ Cell state and hidden are merged together
○ Two gates
■ Reset gate - similar to forget
■ Update gate - similar to input gate
○ No output gate
○ Cell/Hidden state is completely exposed to subsequent units.
● GRU needs fewer parameters to learn and is relatively efficient w.r.t.
computation.
53
A GRU cell
54
Application of DL methods for NLP tasks
55
NLP hierarchy
● Like deep learning, NLP happens in layers!
● Each task receives features from its previous (lower-level) task, and process
them to produce its own output and so on.
56
NLP problems
Problems Paradigm
POS Tagging
Sequence Labelling
Named Entity Recognition
Sentiment Analysis Classification
Machine Translation
Sequence Transformation
Question Answering
Summarization
57
Sequence Labelling
58
RNN/LSTM/GRU for Sequence Labelling
Part-of-speech tagging:
● Given a sentence X, tag each word its corresponding grammatical class.
59
● Sentence matrix
● Pad sentence to ensure the
sequence length
○ Pad length = filter_size - 1
○ Evenly distribute padding at the
start and end of the sequence.
● Apply Convolution filters
● Classification
CNN for Sequence Labelling
60
Classification
61
RNN/LSTM/GRU for Sentence Classification
Sentiment Classification:
● Given a sentence X, identify the expressed sentiment.
62
Zhang, Y., Wallace, B. ; A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classi?cation; In Proceedings of the 8th International Joint
Conference on Natural Language Processing (IJCNLP-2017); pages 253-263; Taipie, Taiwan; 2017.
1. Sentence matrix
a. embeddings of words
2. Convolution filters
a. Total 6 filters; Two each of size
2, 3 & 4.
b. 1 feature maps for each filter
3. Pooling
a. 1-max pooling
4. Concatenate the max-pooled vector
5. Classification
a. Softmax
CNN for Sentence Classification
63
Sequence to sequence transformation
with
Attention Mechanism
64
Decoder
Encoder
Sequence labeling v/s Sequence transformation
PRP VBZ NNP
I love mangoes
I love mangoes
PRP VBZ NNP
?
Sentence embeddings
65
Why sequence transformation is required?
● For many application length of I/p and O/p are not necessarily same. E.g.
Machine Translation, Summarization, Question Answering etc.
● For many application length of O/p is not known.
● Non-monotone mapping: Reordering of words.
● Applications for which sequence transformation is not require
○ PoS tagging,
○ Named Entity Recognition
○ ....
66
Encode-Decode paradigm
Decoder
Encoder
Ram eats mango
??? ?? ????
<eos>
?? <eos>
● English-Hindi Machine Translation
○ Source sentence: 3 words
○ Target sentence: 4 words
○ Second word of the source sentence maps to 3rd & 4th words of the target sentence.
○ Third word of the source sentence maps to 2nd word of the target sentence
67
Problems with Encode-Decode paradigm
● Encoding transforms the entire sentence into a single vector.
● Decoding process uses this sentence representation for predicting the output.
○ Quality of prediction depends upon the quality of sentence embeddings.
● After few time steps decoding process may not properly use the sentence
representation due to long-term dependency.
68
Solutions
● To improve the quality of predictions we can
○ Improve the quality of sentence embeddings ‘OR’
○ Present the source sentence representation for prediction at each time step. ‘OR’
○ Present the RELEVANT source sentence representation for prediction at each time step.
■ Encode - Attend - Decode (Attention mechanism)
69
Attention Mechanism
● Represent the source sentence by the set of output vectors from the
encoder.
● Each output vector (OV) at time t is a contextual representation of the input
at time t.
Ram eats mango <eos>
OV1 OV2 OV3 OV4
70
Attention Mechanism
● Each of these output vectors (OVs) may not be equally relevant during
decoding process at time t.
● Weighted average of the output vectors can resolve the relevancy.
○ Assign more weights to an output vector that needs more attention during decoding at time t.
● The weighted average context vector (CV) will be the input to decoder along
with the sentence representation.
○ CVi
= ∑j
aij
. OVj
where aij
is the attn-wt of the jth
OV
71
Attention Mechanism
Ram eats mango <eos>
Attention
Decoder
Encoder
CV
at1
at2
at3
at4
Decoder takes two inputs:
● Sentence vector
● Attention vector
72
Attention Mechanism
???
Ram eats mango <eos>
CV
at1
at2
at3
at4
t=1
73
Attention Mechanism
??? ??
Ram eats mango <eos>
CV
at1
at2
at3
at4
t=2
74
Attention Mechanism
??? ?? ????
Ram eats mango <eos>
CV
at1
at2
at3
at4
t=3
75
Attention Mechanism
??? ?? ???? ??
Ram eats mango <eos>
CV
at1
at2
at3
at4
t=4
76
Attention Mechanism
??? ?? ???? ?? <eos>
Ram eats mango <eos>
CV
at1
at2
at3
at4
t=5
77
Attention Mechanism
1. Bi-RNN Encoder
2. Attention
3. RNN Decoder
4. Output Embeddings
5. Output probabilities
78
[Garc??a-Mart??nez et al., 2016]
Attention - Types
Given an input sequence (x1
, x2
, … , xN
) and an output sequence (y1
, y2
, … , yM
)
● Encoder-Decoder Attention
○ yj
| x1
, x2
, … , xN
● Decoder Attention
○ yj
| y1
, y2
, … , yj-1
● Encoder Attention (Self)
○ xi
| x1
, x2
, … , xN
79
Word Representation
80
?
?
–
?
–
–
–
–
–
?
–
–
–
? Word2vec [Mikolov et al., 2013]
– Contextual model
– Two variants
? Skip-gram
? Continuous Bag-of-word
? GloVe [Pennington et al., 2014]
– Co-occurrence matrix
– Matrix Factorization
? FastText [Bojanowski et al., 2016]
– Similar to word2vec
– Works on sub-word level
? Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2018]
– Based on Transformer model
? Embeddings from Language Models (ELMo) [Peters et al., 2018]
– Contextual
? The representation for each word depends on the entire context in which it is used.
Few good reads..
● Denny Britz; Recurrent Neural Networks Tutorial, Part 1-4
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introducti
on-to-rnns/
● Andrej Karpathy; The Unreasonable E?ectiveness of Recurrent Neural Networks
http://karpathy.github.io/2015/05/21/rnn-e?ectiveness/
● Chris Olah; Understanding LSTM Networks
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
83
Question!
84
Workshop on
AI for Computational Social Systems (ACSS)
Sunday, 9th Feb 2020
(http://lcs2.iiitd.edu.in/acss2020/)
Organizer
Laboratory for Computational Social Systems (LCS2) @ IIIT Delhi.
85
Registration Fee
Rs. 200/-
Thank You!
86

More Related Content

Convolutional and Recurrent Neural Networks

  • 2. Outline ● Deep Learning ● AutoEncoder ● Convolutional Neural Network (CNN) ● Recurrent Neural Network (RNN) ● Long Short Term Memory (LSTM) ● Gated Recurrent Unit (GRU) ● Attention Mechanism ● Few NLP Applications 2
  • 3. Few key terms to start with ● Neurons ● Layers ○ Input, Output and Hidden ● Activation functions ○ Sigmoid, Tanh, Relu ● Softmax ● Weight matrices ○ Input → Hidden, Hidden → Hidden, Hidden → Output ● Backpropagation ○ Optimizers ■ Gradient Descent (GD), Stochastic Gradient Descent (SGD), Adam etc. ○ Error (Loss) functions ■ Mean-Squared Error, Cross-Entropy etc. ○ Gradient of error ○ Passes: Forward pass and Backward pass 3
  • 4. History of Neural Network and Deep learning ● Neural Network and Perceptron learning algorithm: [McCulloch and Pitts (1943), Rosenblatt (1957)] ● Backpropagation: Rumelhart, Hinton and Williams, 1986 ○ Theoretically, a neural network can have any number of hidden layers. ○ But, in practice, it rarely had more than one layer hidden layers. ■ Computational issue: Limited computing power ■ Algorithmical issues: Vanishing gradient and Exploding gradient. ● Beginning of Deep learning: Late 1990’s and early 2000’s ○ Solutions: ■ Computational issue: Advance computing powers such as GPUs, TPUs ■ Algorithmical issues ● Pre-training (e.g., AutoEncoder, RBM) ● Better architectures (e.g., LSTM) ● Better activation functions (e.g., Relu) 4
  • 5. Deep Learning vs Machine Learning Paradigm ● The main advantage of deep learning based approaches is the trainable features, i.e., it extracts relevant features, on its own, during training. ● Requires minimal human intervention. 5
  • 6. Why Deep Learning? ● Recall, artificial neural network tries to mimic the functionality of a brain. ● In brain, computations happen in layers. ● View of representation ○ As we go up in the network, we get high-level representations ? Assists in performing more complex tasks. 6
  • 7. Why Deep Architectures were hard to train? ● General weight-updation rule ● For lower-layers in deep architecture ○ δj will vanish, if it is less than 1 ○ δj will explode, if it is more than 1 7
  • 10. AutoEncoder: Layer 1 10 z = f(x), where z ≈ x z1 z2 z3 z4 z5
  • 14. Deep Learning Architectures ● Convolutional neural network (CNN) ○ Aims to extract the local spatial features ● Recurrent neural network (RNN) ○ Exploits the sequential information of a sentence (sentence is a sequence of words). 14
  • 15. Convolutional Neural Network LeCunn and Bengio (1995) 15
  • 16. Convolutional Neural Networks (CNN) ● A CNN consists of a series (≥ 1) of convolution layer and pooling layer. ● Convolutional operation extracts the feature representations from the input data. ○ Shares the convolution filters over different spatial locations, in a quest of extracting location-invariant features in the input. ○ Shape and weights of the convolution filter determine the features to be extracted from the input data. ○ In general, multiple filters of different shapes are used to ensure the diversity in the extracted features. ● Pooling operation extracts the most relevant features from the convoluted features. Similar to downsampling in image-processing. 16
  • 19. Recurrent Neural Network (RNN) ● A neural network with feedback connections ● Enable networks to do temporal processing ● Good at learning sequences ● Acts as memory unit 19
  • 20. RNN - Example 1 Part-of-speech tagging: ● Given a sentence X, tag each word its corresponding grammatical class. 20
  • 21. RNN - Example 2 ● ● ● ○ ○ ○ ○ 21
  • 23. How to train RNNs? ● Typical FFN ○ Backpropagation algorithm ● RNNs ○ A variant of backpropagation algorithm namely Back-Propagation Through Time (BPTT). 23
  • 24. BackPropagation Through Time (BPTT) Error for an instance = Sum of errors at each time step of the instance Gradient of error 24
  • 25. BackPropagation Through Time (BPTT) For V For W (Similarly for U) 25
  • 26. Visualization of RNN through Feed-Forward Neural Network 26
  • 27. Problem, Data and Network Architecture ● Problem: ○ I/p sequence (X) : X0 , X1 , …, XT ○ O/p sequence (O) : O0 , O1 , …, OT ● Representation of data: ○ I/p dimension : 4 ■ X0 → 0 1 1 0 ○ O/p dimension : 3 ■ O0 → 0 0 1 ● Network Architecture ○ Number of neurons at I/p layer : 4 ○ Number of neurons at O/p layer : 3 ○ Do we need hidden layers? ■ If yes, number of neurons at each hidden layers 27
  • 31. U U W X0 X1 O0 O1 O1 = f(W.O0 + U.X1 ) = f([W, U] . [O0 , x1 ]) t 0 1 Network @ t = 1 31
  • 32. U U W X0 X1 U W X2 O2 O2 = f(W.O1 + U.X2 ) = f([W, U] . [O1 , x2 ]) t 0 1 2 O0 O1 Network @ t = 2 32
  • 35. U U W X0 X1 U W X2 W O1 O0 O2 W W W U U U X 0 X 1 X 2 O-1 =0 View 1 View 2 Different views O1 O0 O2 O-1 =0 35
  • 36. O0 O1 O2 W W W U U U X0 X1 X2 O-1 Ot Ot-1 U Xt View 3 View 4 W U U W X0 X1 U W X2 W O1 O0 O2 W W W U U U X 0 X 1 X 2 O-1 =0 View 1 View 2 Different views O1 O0 O2 O-1 =0 36
  • 37. When to use RNNs 37
  • 38. Usage ● Depends on the problems that we aim to solve. ● Typically good for sequence processings. ● Some sort of memorization is required. 38
  • 39. Bit reverse problem ● Problem definition: ○ Problem 1: Reverse a binary digit. ■ 0 → 1 and 1 → 0 ○ Problem 2: Reverse a sequence of binary digits. ■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0 ■ Sequence: Fixed or Variable length ○ Problem 3: Reverse a sequence of bits over time. ■ 0 1 0 1 0 0 1 → 1 0 1 0 1 1 0 ○ Problem 4: Reverse a bit if the current i/p and previous o/p are same. Input sequence 1 1 0 0 1 0 0 0 1 1 Output sequence 1 0 1 0 1 0 1 0 1 0 39
  • 40. Data Let ● Problem 1 ○ I/p dimension: 1 bit O/p dimension: 1 bit ● Problem 2 ○ Fixed ■ I/p dimension: 10 bit O/p dimension: 10 bit ○ Variable: Pad each sequence upto max sequence length: 10 ■ Padding value: -1 ■ I/p dimension: 10 bit O/p dimension: 10 bit ● Problem 3 & 4 ○ Dimension of each element of I/p (X) : 1 bit ○ Dimension of each element of O/p (O) : 1 bit ○ Sequence length : 10 40
  • 41. Network Architecture Problem 1: ● I/p neurons = 1 ● O/p neurons = 1 W W W U U Ot Ot-1 U Xt O-1 O10 W U X10 …. No. of I/p neurons = I/p dimension No. of O/p neurons = O/p dimension Problem 2: Fixed & Variable ● I/p neurons = 10 ● O/p neurons = 10 W O U X U X0 O0 O1 O10 X1 X10 Problem 3: ● I/p neurons = 1 ● O/p neurons = 1 ● Seq len = 10 U Xt = X10 , … , X1 , X0 Ot = O10 , … , O1 , O0 Problem 4: ● I/p neurons = 1 ● O/p neurons = 1 ● Seq len = 10 …. U X0 O0 O1 O10 X1 X10 …. 41 X0 O0 O1 X1
  • 42. Different configurations of RNNs Image Captioning Sentiment Analysis Machine Translation Language modelling 42
  • 46. ● Cue word for the prediction ○ Example 1: sky → clouds [3 units apart] ○ Example 2: hindi → India [9 units apart] ● As the sequence length increases, it becomes hard for RNNs to learn “long-term dependencies.” ○ Vanishing gradients: If weights are small, gradient shrinks exponentially. Network stops learning. ○ Exploding gradients: If weights are large, gradient grows exponentially. Weights fluctuate and become unstable. Vanishing/Exploding gradients 46
  • 47. RNN extensions ● Bi-directional RNN ● Deep (Bi-directional) RNN 47
  • 48. Long Short Term Memory (LSTM) Hochreiter & Schmidhuber (1997) 48
  • 49. LSTM ● A variant of simple RNN (Vanilla RNN) ● Capable of learning long dependencies. ● Regulates information flow from recurrent units. 49
  • 50. Vanilla RNN vs LSTM 50
  • 51. An LSTM cell 51 ● Cell state ct (blue arrow), hidden state ht (green arrow) and input xt (red arrow) ● Three gates ○ Forget (Red-dotted box) ○ Input (Green-dotted box) ○ Output (Blue-dotted box)
  • 52. Gated Recurrent Units (GRU) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio (2014) 52
  • 53. Gated Recurrent Unit (GRU) [Cho et al. (2014)] ● A variant of simple RNN (Vanilla RNN) ● Similar to LSTM ○ Whatever LSTM can do GRU can also do. ● Differences ○ Cell state and hidden are merged together ○ Two gates ■ Reset gate - similar to forget ■ Update gate - similar to input gate ○ No output gate ○ Cell/Hidden state is completely exposed to subsequent units. ● GRU needs fewer parameters to learn and is relatively efficient w.r.t. computation. 53
  • 55. Application of DL methods for NLP tasks 55
  • 56. NLP hierarchy ● Like deep learning, NLP happens in layers! ● Each task receives features from its previous (lower-level) task, and process them to produce its own output and so on. 56
  • 57. NLP problems Problems Paradigm POS Tagging Sequence Labelling Named Entity Recognition Sentiment Analysis Classification Machine Translation Sequence Transformation Question Answering Summarization 57
  • 59. RNN/LSTM/GRU for Sequence Labelling Part-of-speech tagging: ● Given a sentence X, tag each word its corresponding grammatical class. 59
  • 60. ● Sentence matrix ● Pad sentence to ensure the sequence length ○ Pad length = filter_size - 1 ○ Evenly distribute padding at the start and end of the sequence. ● Apply Convolution filters ● Classification CNN for Sequence Labelling 60
  • 62. RNN/LSTM/GRU for Sentence Classification Sentiment Classification: ● Given a sentence X, identify the expressed sentiment. 62
  • 63. Zhang, Y., Wallace, B. ; A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classi?cation; In Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP-2017); pages 253-263; Taipie, Taiwan; 2017. 1. Sentence matrix a. embeddings of words 2. Convolution filters a. Total 6 filters; Two each of size 2, 3 & 4. b. 1 feature maps for each filter 3. Pooling a. 1-max pooling 4. Concatenate the max-pooled vector 5. Classification a. Softmax CNN for Sentence Classification 63
  • 64. Sequence to sequence transformation with Attention Mechanism 64
  • 65. Decoder Encoder Sequence labeling v/s Sequence transformation PRP VBZ NNP I love mangoes I love mangoes PRP VBZ NNP ? Sentence embeddings 65
  • 66. Why sequence transformation is required? ● For many application length of I/p and O/p are not necessarily same. E.g. Machine Translation, Summarization, Question Answering etc. ● For many application length of O/p is not known. ● Non-monotone mapping: Reordering of words. ● Applications for which sequence transformation is not require ○ PoS tagging, ○ Named Entity Recognition ○ .... 66
  • 67. Encode-Decode paradigm Decoder Encoder Ram eats mango ??? ?? ???? <eos> ?? <eos> ● English-Hindi Machine Translation ○ Source sentence: 3 words ○ Target sentence: 4 words ○ Second word of the source sentence maps to 3rd & 4th words of the target sentence. ○ Third word of the source sentence maps to 2nd word of the target sentence 67
  • 68. Problems with Encode-Decode paradigm ● Encoding transforms the entire sentence into a single vector. ● Decoding process uses this sentence representation for predicting the output. ○ Quality of prediction depends upon the quality of sentence embeddings. ● After few time steps decoding process may not properly use the sentence representation due to long-term dependency. 68
  • 69. Solutions ● To improve the quality of predictions we can ○ Improve the quality of sentence embeddings ‘OR’ ○ Present the source sentence representation for prediction at each time step. ‘OR’ ○ Present the RELEVANT source sentence representation for prediction at each time step. ■ Encode - Attend - Decode (Attention mechanism) 69
  • 70. Attention Mechanism ● Represent the source sentence by the set of output vectors from the encoder. ● Each output vector (OV) at time t is a contextual representation of the input at time t. Ram eats mango <eos> OV1 OV2 OV3 OV4 70
  • 71. Attention Mechanism ● Each of these output vectors (OVs) may not be equally relevant during decoding process at time t. ● Weighted average of the output vectors can resolve the relevancy. ○ Assign more weights to an output vector that needs more attention during decoding at time t. ● The weighted average context vector (CV) will be the input to decoder along with the sentence representation. ○ CVi = ∑j aij . OVj where aij is the attn-wt of the jth OV 71
  • 72. Attention Mechanism Ram eats mango <eos> Attention Decoder Encoder CV at1 at2 at3 at4 Decoder takes two inputs: ● Sentence vector ● Attention vector 72
  • 73. Attention Mechanism ??? Ram eats mango <eos> CV at1 at2 at3 at4 t=1 73
  • 74. Attention Mechanism ??? ?? Ram eats mango <eos> CV at1 at2 at3 at4 t=2 74
  • 75. Attention Mechanism ??? ?? ???? Ram eats mango <eos> CV at1 at2 at3 at4 t=3 75
  • 76. Attention Mechanism ??? ?? ???? ?? Ram eats mango <eos> CV at1 at2 at3 at4 t=4 76
  • 77. Attention Mechanism ??? ?? ???? ?? <eos> Ram eats mango <eos> CV at1 at2 at3 at4 t=5 77
  • 78. Attention Mechanism 1. Bi-RNN Encoder 2. Attention 3. RNN Decoder 4. Output Embeddings 5. Output probabilities 78 [Garc??a-Mart??nez et al., 2016]
  • 79. Attention - Types Given an input sequence (x1 , x2 , … , xN ) and an output sequence (y1 , y2 , … , yM ) ● Encoder-Decoder Attention ○ yj | x1 , x2 , … , xN ● Decoder Attention ○ yj | y1 , y2 , … , yj-1 ● Encoder Attention (Self) ○ xi | x1 , x2 , … , xN 79
  • 82. ? Word2vec [Mikolov et al., 2013] – Contextual model – Two variants ? Skip-gram ? Continuous Bag-of-word ? GloVe [Pennington et al., 2014] – Co-occurrence matrix – Matrix Factorization ? FastText [Bojanowski et al., 2016] – Similar to word2vec – Works on sub-word level ? Bidirectional Encoder Representations from Transformers (BERT) [Devlin et al., 2018] – Based on Transformer model ? Embeddings from Language Models (ELMo) [Peters et al., 2018] – Contextual ? The representation for each word depends on the entire context in which it is used.
  • 83. Few good reads.. ● Denny Britz; Recurrent Neural Networks Tutorial, Part 1-4 http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introducti on-to-rnns/ ● Andrej Karpathy; The Unreasonable E?ectiveness of Recurrent Neural Networks http://karpathy.github.io/2015/05/21/rnn-e?ectiveness/ ● Chris Olah; Understanding LSTM Networks http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 83
  • 85. Workshop on AI for Computational Social Systems (ACSS) Sunday, 9th Feb 2020 (http://lcs2.iiitd.edu.in/acss2020/) Organizer Laboratory for Computational Social Systems (LCS2) @ IIIT Delhi. 85 Registration Fee Rs. 200/-