Core slides used to present on LLMs at five events in 2023¨C24.
1 of 50
Download to read offline
More Related Content
Details on how large language models are constructed
1. LLM core slides
Richard Saldanha
richard@oxquant.com
Dec 2023: University of Oxford¡ªabstract;
Apr 2024: Queen Mary University of London¡ªECOM198 NLP lecture;
Jun: London Mathematical Society Research School, LSE¡ªevent listing;
Jun: Brighton College¡ªnews; and
Sep: Institute of Science and Technology Conference 2024, Lancaster University.
2. Introduction
? Natural Language Processing (NLP) is the big picture
? Large Language Models (LLMs) are specific NLP tools, a
more recent approach
? NLP is an entire discipline of how computers ¡®understand¡¯ and
work with human language
? task examples: machine translation, sentiment analysis, text
summarization, text generation, speech recognition and much
more
? LLMs are trained on massive amounts of data, perform many
NLP tasks with high proficiency, especially tasks involving the
generation of text
1
3. ? LLMs can handle complex aspects of language such as
context, grammar and even sarcasm
This talk focuses specifically on LLMs, how they are put together
and their general characteristics
Intended as a sketch of the mechanics rather than deep dive into
all the intricacies of building LLMs
2
4. Talk overview
? Look at toy character level language model
? direct modelling of character pairs (bigram language)
? maximum likelihood estimation
? equivalent neural network fit
? Neural network extension
? Generating Shakespearean language
? LLM generalization
? General observations
3
5. makemore ¨C character level language model example
Andrej Karpathy https://github.com/karpathy/makemore
names = open("../data/names.txt", "r").read().splitlines()
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte']
...
['zykeem', 'zylas', 'zyran', 'zyrie', 'zyron', 'zzyzx']
Names in file: 32033
Shortest name: 2 characters
Longest name: 15 characters
All names lower case Latin letters (26) with no special characters
Dataset taken from https://www.ssa.gov/oact/babynames/
4
6. Examine character pair combinations
? Form simple bigram language model (two-letter
combinations)
? Pair combinations making up the first three
names in dataset (shown right) ¨C emma, olivia,
ava
? Aim to predict the next character in sequence
given current character
? immediate structure to these data
? Note use of special start and end character ¡®.¡¯
. e
e m
m m
m a
a .
. o
o l
l i
i v
v i
i a
a .
. a
a v
v a
a .
5
8. Displaying
just the
first 9 ¡Á 9
block of
data (ease
of viewing)
We still
compute
over all 27
¡Á 27
character
pairs
7
9. Convert
counts to
probabili-
ties (divide
by row
sums)
Start with top
row and select
1st letter
(following ¡®.¡¯)
then choose 2nd
letter from row
corresponding to
1st letter
selected and
continue (by row
based on
preceding letter)
until ¡®.¡¯ is
selected
8
10. Rows form multinomial distributions
Random variable ? with ? integer categories, so that
? ¡Ê 1, 2, ¡ , ?. Let ?(? = ?) = ?? then parameters ?1, ¡ , ??
describe the entire distribution of ? with the constraint ¡Æ?
?? = 1.
Generate ?1, ¡ , ?? from the above distribution and let
?? = ¡Æ
?
?=1
?(?? = ?), i.e. ? indicator returns the number of
observations in the category ?, then the random ? ¡Á 1 vector
? = (?1, ¡ , ??)¡ä
is said to be from a multinomial distribution
with parameters (?, ?1, ¡ , ??), i.e.
? ¡« ??(?; ?1, ¡ , ??)
The probability density function is given by
?(? = ?) = ?(?1 = ?1, ¡ , ?? = ??) =
?!
?1! ? ??!
??1
1 ? ?
??
?
9
11. Sample from model
linioanahobackrrus.
kkay.
juwerranisirmon.
ca.
a.
antanena.
janberati.
tofana.
sie.
ba.
jobre.
c.
dwavavatonslylee.
lee.
nieleme.
janni.
Uniform sampling (ignoring data)
uhmktrbvyxofhognbqvotyve.
hsav.
vipbsfwswzdhqzvxhcinpgjajzqczkoedkoqugcpkzxwmok
vakewhmwknucywjkepmgy.
dlclozznmf.
yhkngonmtyukavgvikyyqdtgdhhdwevbehbebmpgbkdkjkf
bzfophfqrjpmsvojfw.
kdgbnefrxvewgcfrmcyepcnomrnzxenginguqucwktqjpsc
yqwmqjmkikbmzggpzffbpymaehatrwarfyj.
gcejhkpjlylbacjevvqwlmsrxsnllc.
fskkrlhibmpvscbbjvddvznkvygwluzrabykxm.
ocuggumlkpss.
kzwpwftcygtgnizuvraebsbakgnxiyehsbqthttlqe.
netszovnjhzqsjdzovzkpjvinjnmu.
fkyrhdaawwflhtyqeqyhuvrkwrbqzskxxarohlwtcresy.
bixgrecua.
10
12. Maximum likelihood estimation
The likelihood is the product of the probabilities that we can use to
evaluate the ¡®quality¡¯ of the model.
Likelihood of parameter ? defined to be function of the values of
random variables ?1, ¡ , ?? assumed IID wrt some density ?
?(?; ?1, ¡ , ??) = ¡Ç ?(??; ?)
The likelihood function gives the probability of observing the given
data as a function of the parameter ?. The maximum likelihood
estimate of ? is value ?
? that ¡®maximizes¡¯ the likelihood, i.e. makes
the observed data most probable or most likely.
? More convenient to (equivalently) minimize the negative log
likelihood (minimum zero)
? Average negative log likelihood is nice quantity to work with,
call this ¡®loss¡¯
11
16. Trivial neural network fit
? Armed with a loss function we can fit a neural network
? One input layer (27 units), no hidden layers, one output layer
(27 units), no special activation function
? Number lowercase characters: 0 = ¡®.¡¯, 1 = ¡®a¡¯, ..., 26 = ¡®z¡¯,
e.g. ¡®.richard.¡¯ bigrams represented by integers:
. r
r i
i c
c h
h a
a r
r d
d .
input = tensor([ 0, 18, 9, 3, 8, 1, 18, 4])
target = tensor([18, 9, 3, 8, 1, 18, 4, 0])
15
18. Single neuron
? ? 8 ¡Á 27 matrix is the encoding of .richard
? ? ¡« ?(0, 1) 27 ¡Á 1 vector of weights sampled from standard
normal density (merely convenient starting point)
? ?? vector of dimension 8 ¡Á 1
? The ?? is the activation for every input example
? Aim to produce probability for the next character in sequence
27 neurons
? Compute over all neurons (units) simultaneously ¨C ? now
matrix of dimension 27 ¡Á 27 ¨C yields 8 ¡Á 27 matrix
17
19. Fitting procedure
? Interpret ? matrix as weights so that ?? are log counts
(logits)
? Normalize to turn into probabilities (¡®softmax¡¯ procedure)
?? =
???
¡Æ
?
?=1
???
for ?? logits and ? output nurons
? Tune ? via NN so that probabilities make sense
? Use loss function to guide the optimization (minimization)
? All steps are differentiable so easy to perform descent
minimization
? PyTorch loss.backward() computes gradients of model
parameters with respect to the loss ¨C ¡®backpropagation¡¯ 18
20. Backpropagation
(optimiza-
tion) Over
bigrams in
¡®richard.¡¯
#examples: 8
#iterations: 15
step length: -0.1
loss = 3.95747
loss = 3.94469
loss = 3.93191
loss = 3.91915
loss = 3.90640
loss = 3.89366
loss = 3.88094
loss = 3.86822
loss = 3.85552
loss = 3.84283
loss = 3.83015
loss = 3.81748
loss = 3.80482
loss = 3.79218
loss = 3.77955
19
21. Optimization
over training
sample (80% of
bigram
examples)
#examples: 182473
#iterations: 150
step length: -50
end loss = 2.46398
NN probabilities
close to those
obtained by
maximizing
likelihood
directly
Regularization:
add likelihood
penalty
?(¡Æ ?2
? )/?
(? > 0) as
? ¡ú 0
probabilities get
more uniform
(analogue of ?
used previously)
20
23. Why bother with neural networks?
? Extension of simple bigram model approach is di?icult
? NN approach easily allows for more complex architectures, e.g.
hidden layers, CNNs, Transformer, RNNs
X1
X2
X3
X4
X5
X6
.
.
.
Xp
A
(1)
1
A
(1)
2
A
(1)
3
A
(1)
4
.
.
.
A
(1)
K1
A
(2)
1
A
(2)
2
A
(2)
3
.
.
.
A
(2)
K2
f0(X) Y0
f1(X) Y1
.
.
.
.
.
.
f9(X) Y9
Hidden
layer L2
Hidden
layer L1
Input
layer
Output
layer
W1
W2
B
Feed-forward NN with two hidden
layers and multiple outputs taken
from James et al. (2021), p.209. The
input layer has ? = 784 units, the
two hidden layers ?1 = 256 and
?2 = 128 units, and the output
layer 10 units. Along with intercepts,
constants referred to as ¡®biases¡¯ by
NN practitioners, this network has
235,146 parameters or weights. The
W1 (of dimension 785 ¡Á256) and
W2 (257 ¡Á128) are matrices of
weights feeding into the first and
second hidden layers ¨C ?1 and ?2,
respectively. Finally, B (129 ¡Á 10) is
another matrix of weights feeding into
the output layer. (Note ? + 1,
?1 + 1 and ?2 + 1 are the first
dimensions of W1, W2 and B
matrices, respectively, to include the
biases.)
22
24. ? Flexible NN activation functions suggest improved modelling
Typical NN activation functions: a sigmoid (various forms exist), Rectified
Linear Unit (ReLU) scaled by a divisor of five, and tanh
? Basic idea for predicting next character remains the same for
sentence construction
23
26. nanoGPT ¨C generating Shakespearean prose
Andrej Karpathy https://github.com/karpathy/nanoGPT
with open('../data/input.txt', 'r', encoding='utf-8') as f:
text = f.read()
File input.txt contains 40,000 lines of Shakespeare from a variety of plays
First Citizen: Before we proceed any further, hear
me speak. All: Speak, speak. First Citizen: You
are all resolved rather to die than to famish?
All: Resolved. resolved. First Citizen: First,
you know Caius Marcius is chief enemy to the
people. All: We know't, we know't. First
Citizen: Let us kill him, and we'll have corn at
our own price. Is't a verdict? All: No more
talking on't; let it be done: away, away! Second
Citizen: One word, good citizens. First Citizen:
We are accounted poor citizens, the patricians
good. What authority surfeits on would relieve us:
if they would yield us but the superfluity, while
it were wholesome, we might guess they relieved us
humanely; but they think we are too dear: the
leanness that afflicts us, the object of our
misery, is as an inventory to particularise their
abundance; our sufferance is a gain to them Let us
revenge this with our pikes, ere we become rakes:
for the gods know I speak this in hunger for
bread, not in thirst for revenge. Second Citizen:
Would you proceed especially against Caius
Marcius? All: Against him first: he's a very dog
to the commonalty. Second Citizen: Consider you
what services he has done for his country?
Length of dataset in characters: 1115394
25
27. Vocabulary
A vocabulary is usually a set of words making up the language but
as we are concerned with this example at a character level, the
vocabulary consists of individual Latin letters and symbols.
The Shakespeare dataset consists of a newline character, a space
and 11 special characters plus 52 Latin letters (26 uppercase and
26 lowercase), i.e.
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
size of vocabulary: 65
26
28. Tokenization
Create a mapping from characters to numbers (integers)
encode: [13, 1, 42, 39, 45, 45, 43, 56, 1, 21, 1, 57, 43, 43]
decode: A dagger I see
More sophisticated tokenization methods exist:
? SentencePiece implements subword units (e.g.,
byte-pair-encoding?
(BPE) and unigram language model) with
the extension of direct training from raw sentences
? tiktoken is a fast BPE tokenizer for use with OpenAI¡¯s models
?
Byte Pair Encoding (BPE) (Gage (2019)) is a simple data compression
technique that iteratively replaces the most frequent pair of bytes in a sequence
with a single, unused byte.
27
30. Training the model
# Split data into train and test sets
n = int(0.9*len(data)) # 90% train: 10% test
train = data[:n]
test = data[n:]
block_size = 8
train[:block_size+1]
x = train[:block_size]
y = train[1:block_size+1]
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"when input is {context} the target is: {target}")
29
31. The target is the next character in the sequence y[t] given the
prior inputs x[:t+1].
when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58, 1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target is: 58
We can process more than one block at a time¡
inputs:
torch.Size([5, 8])
tensor([[57, 1, 46, 47, 57, 1, 50, 53],
[ 1, 58, 46, 43, 56, 43, 1, 41],
[17, 26, 15, 17, 10, 0, 32, 53],
[57, 58, 6, 1, 61, 47, 58, 46],
[ 6, 0, 14, 43, 44, 53, 56, 43]])
targets:
torch.Size([5, 8])
tensor([[ 1, 46, 47, 57, 1, 50, 53, 60],
[58, 46, 43, 56, 43, 1, 41, 39],
[26, 15, 17, 10, 0, 32, 53, 1],
[58, 6, 1, 61, 47, 58, 46, 0],
[ 0, 14, 43, 44, 53, 56, 43, 1]])
30
32. Self-attention mechanism
The self-attention mechanism is a crucial component in large
language models (LLMs)
? Introduced in the Transformer architecture Vaswani et al.
(2017), now widely adopted in LLM builds
? Self-attention refers to model¡¯s ability to weigh different parts
of input sequence differently when processing each element
? Allows model to focus more on relevant tokens while
generating a particular word in the sequence
? Attention weights are derived that represent contribution of
each word in the sequence in relation to current word¡¯s
understanding or generation
31
33. Contextualized representation
? Weighted sum added to original word¡¯s embedding, producing
contextualized representation
? This contextualized representation captures influence of other
words in sequence on current word
? By allowing model to dynamically attend to different parts of
the input sequence, long-range dependencies and relationships
between words are captured
? Helps model ¡®consider context¡¯ when making predictions or
generating coherent responses
32
34. Training and average loss estimation is accomplished using a
feedforward neural network combined with the bigram model.
Model computes attention scores for each input token against all
other tokens.
? Attention scores determine the relative importance of each
input token
? Mechanism allows each token to gather information from all
other tokens in the sequence
Training proceeds iteratively by sampling from the underlying
dataset not on all 1.1m characters in the dataset.
33
35. Transformer architecture
Graphical description of the transformer
taken from Vaswani et al. (2017).
The encoder maps an input sequence of
symbol representations (?1, ¡ , ??) to
a sequence of continuous
representations ? = (?1, ¡ , ??). Given
?, the decoder generates an output
sequence (?1, ¡ , ??) of symbols one
element at a time. At each step, the
model is auto-regressive ¨C consuming
the previously generated symbols as
additional input when generating the
next symbol.
The Transformer follows this overall
architecture using stacked self-attention
and point-wise, fully connected layers
for both the encoder and decoder,
shown in the left and right halves of
the diagram, respectively. 34
36. Fakespeare
As bless your prosent'st subful a intent make fully of his thus;
Comile, and that we! loss that be this poce he Woo, commed that
us how Ty bloody's my despery lorde gromiowerss me askiss!' Whe
compeased heaveds like, Cwerracions commone it unport: In your
strange, Why years should blut repshies, We it speakingd all
well: shall we their our suyal.
BENVOLIO: Are, hi I have confort el it, any I prod.
CORIOLIO: No too, quitaring thee! Are our at your bloody---with
your orearded far oldies, And them he lard many to like, and
witht to brount As no commonne then he dent.
YORK: What your by wellass that winderingues, A moothers of it
you heavet, but but shad, Living thank his serving; I know, Is am
thee in to brest thy heaur it: met and I what, What no brothing
our the right ones?
BENVOLIO: Sh
35
37. LLM interaction
My input: The best thing about statistics is its
ability to
Next word output from GPT-3.5 in sentence:
Next word Probability
distill 9.0%
provide 5.2%
reveal 6.2%
systematically 2.6%
uncover 2.5%
? Output words are real based on input (probabilities fictional)
? Ranked list of next words that might follow input
36
38. ? Likely to generate gibberish if only probabilities of words are
considered
? As with letter combinations, taking into account probabilities
of pairs or longer ?-grams of words is sensible
? Su?iciently long ?-grams should yield useful natural language
model
? Include temperature parameter so next word isn¡¯t always the
one with highest probability
? varies the output (less boring) given identical input
37
39. LLM sentence generation
The best thing about statistics is its ability to provide
The best thing about statistics is its ability to provide a
The best thing about statistics is its ability to provide a
systematic
The best thing about statistics is its ability to provide a
systematic and
The best thing about statistics is its ability to provide a
systematic and objective
The best thing about statistics is its ability to provide a
systematic and objective method
The best thing about statistics is its ability to provide a
systematic and objective method for
The best thing about statistics is its ability to provide a
systematic and objective method for collecting
...
38
40. Varying the output: Five GPT-3.5 responses to same input
1. The best thing about statistics is its ability to provide a systematic and
objective method for collecting, analyzing, interpreting, presenting,
and organizing data
2. The best thing about statistics is its ability to distill complex
information into meaningful insights, facilitating evidence-based
decision-making and informed understanding of the world.
3. The best thing about statistics is its ability to reveal meaningful
patterns, trends, and insights hidden within complex data.
4. The best thing about statistics is its ability to systematically analyze
and interpret data, providing objective and evidence-based insights
that inform decision-making, problem-solving, and understanding
complex phenomena.
5. The best thing about statistics is its ability to distill complex information
into meaningful patterns, trends, and insights, providing a systematic
and objective way to understand, interpret, and communicate data.
39
41. Current LMM ecosystem
Evolution of existing large language models (model size larger than 10bn
parameters) in recent years taken from Zhao et al. (2023). (LLMs with publicly
available model checkpoints are highlighted in yellow.)
40
42. LLM training
Chunk of internet
¡Ö 4.7TB pre-training
data 1.4TB tokens
(image: https://www.opte.org/the-internet)
GPU cluster of
NVIDIA A100 80G
running 21 days
(Cost ¡Ö $20m*)
*as generally reported (difficult to estimate
precisely)
LLaMA-65bn
similar LLM
Training Data Compute Ability LLM
? Generally talk in terms of dataset size (?), model size (?),
training compute (?) and budgets for these three elements
? So called ¡®scaling laws¡¯ attempt to determine compute-optimal
training for LLMs in terms of ?, ? and ? wrt model loss
41
43. Observations
? More (parameters) appears superior to fewer in the context of
NNs
? Possible minimization of loss might be better the higher the
dimension
? lots of search directions to look in might help avoid local
minimum
? Different collections of weights give rise to NNs that have
much the same performance on the same problem
? Slightly different fit can give rise to dramatically different
results out-of-sample
? The same NN architecture seems to work well for quite
different tasks
See Wolfram (2023) for more musings here.
42
44. Practical use
LLMs are good at providing answers to specific questions:
? producing a skeleton for written works
? giving guidance on legal and compliance matters
? undertaking language translation
? generating computer code
? summarizing text and extracting key information
? creating graphics
LLMs also claim to do a lot more, including:
? data mining and analysis
? building virtual assistants
? text classification and document organization
? speech-to-text and text-to-speech conversion
? language learning and education more generally
43
45. What¡¯s remarkable
? Wolfram points out that a NN with about as many
connections as human brains have neurons (around 86bn)
does a surprisingly good job of generating human language
(any language it seems)
? Is human language simpler than we might like to believe it is?
? The syntax of human language isn¡¯t a random jumble of
words, it has structure
? NNs appear adept at implicit encoding of this structure
? We can¡¯t empirically decode what a LLM has put together (at
present) but LMMs may aid self-understanding of human
language in the future
44
46. What LLMs are not
? Absolutely not artificial intelligence [IMHO]
? ¡°Something in silicon that behaves in much the same way as a
reasonably intelligent human being in any situation¡± is my
definition of an AGI
? Cannot reason and plan; see Kambhampati (2023) Neither
can it be bargained with, reasoned with; it doesn¡¯t feel pity or
remorse or fear and it absolutely will not stop ¡ [until one
turns the power off]
? Struggles with basic arithmetic (e.g. GPT-4 obtains 59%
accuracy with three-digit multiplication) let alone conditional
probability
45
47. ? Limitations that stem from the nature of training and current
state of the technology
? More data and greater compute ability isn¡¯t the answer; see
Villalobos et al. (2024)
? Plausible-sounding but incorrect or nonsensical answers
? Sensitivity to input phrasing
? Inability to ask clarifying questions when faced with
ambiguous queries
? Biases in training data often replicated
46
48. ? Outputs based purely on the training text data but ¡®based on¡¯
doesn¡¯t mean an LLM can¡¯t
extrapolate/hallucinate/confabulate
? LLMs are ¡®hackable¡¯ computer system and user protections
(human-directed filters) can be bypassed via prompt injections
both trivial and more complex
? Output very often bland, no match for a true subject matter
expert
47
50. References
Gage, P. (2019) A new algorithm for data compression. The C Users Journal, 12, 23¨C38. Available at:
https://typeset.io/papers/a-new-algorithm-for-data-compression-3htk4tchd5.
James, G., Witten, D., Hastie, T., et al. (2021) An Introduction to Statistical Learning with Applications in
R. Second Edition. Springer. Available at: https://www.statlearning.com/.
Kambhampati, S. (2023) Beauty, lies & ChatGPT: Welcome to the post-truth world. The Hill. Available
at: https://thehill.com/opinion/technology/3861182-beauty-lies-chatgpt-welcome-to-the-post-truth-
world/.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention is all you need. In: Advances in neural
information processing systems (eds I Guyon, UV Luxburg, S Bengio, et al.), 2017. Curran Associates,
Inc. Available at: https:
//proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Villalobos, P., Ho, A., Sevilla, J., et al. (2024) Will we run out of data? Limits of LLM scaling based on
human-generated data. Available at: https://arxiv.org/abs/2211.04325.
Wolfram, S. (2023) What Is ChatGPT Doing ¡ and Why Does It Work? Wolfram Media. Available at:
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/.
Zhao, W. X., Zhou, K., Li, J., et al. (2023) A survey of large language models. Available at:
https://arxiv.org/abs/2303.18223.