ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
LLM core slides
Richard Saldanha
richard@oxquant.com
Dec 2023: University of Oxford¡ªabstract;
Apr 2024: Queen Mary University of London¡ªECOM198 NLP lecture;
Jun: London Mathematical Society Research School, LSE¡ªevent listing;
Jun: Brighton College¡ªnews; and
Sep: Institute of Science and Technology Conference 2024, Lancaster University.
Introduction
? Natural Language Processing (NLP) is the big picture
? Large Language Models (LLMs) are specific NLP tools, a
more recent approach
? NLP is an entire discipline of how computers ¡®understand¡¯ and
work with human language
? task examples: machine translation, sentiment analysis, text
summarization, text generation, speech recognition and much
more
? LLMs are trained on massive amounts of data, perform many
NLP tasks with high proficiency, especially tasks involving the
generation of text
1
? LLMs can handle complex aspects of language such as
context, grammar and even sarcasm
This talk focuses specifically on LLMs, how they are put together
and their general characteristics
Intended as a sketch of the mechanics rather than deep dive into
all the intricacies of building LLMs
2
Talk overview
? Look at toy character level language model
? direct modelling of character pairs (bigram language)
? maximum likelihood estimation
? equivalent neural network fit
? Neural network extension
? Generating Shakespearean language
? LLM generalization
? General observations
3
makemore ¨C character level language model example
Andrej Karpathy https://github.com/karpathy/makemore
names = open("../data/names.txt", "r").read().splitlines()
['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte']
...
['zykeem', 'zylas', 'zyran', 'zyrie', 'zyron', 'zzyzx']
Names in file: 32033
Shortest name: 2 characters
Longest name: 15 characters
All names lower case Latin letters (26) with no special characters
Dataset taken from https://www.ssa.gov/oact/babynames/
4
Examine character pair combinations
? Form simple bigram language model (two-letter
combinations)
? Pair combinations making up the first three
names in dataset (shown right) ¨C emma, olivia,
ava
? Aim to predict the next character in sequence
given current character
? immediate structure to these data
? Note use of special start and end character ¡®.¡¯
. e
e m
m m
m a
a .
. o
o l
l i
i v
v i
i a
a .
. a
a v
v a
a .
5
All pairs
(counts)
Examine
distribution
of letter
pairs across
entire
dataset
6
Displaying
just the
first 9 ¡Á 9
block of
data (ease
of viewing)
We still
compute
over all 27
¡Á 27
character
pairs
7
Convert
counts to
probabili-
ties (divide
by row
sums)
Start with top
row and select
1st letter
(following ¡®.¡¯)
then choose 2nd
letter from row
corresponding to
1st letter
selected and
continue (by row
based on
preceding letter)
until ¡®.¡¯ is
selected
8
Rows form multinomial distributions
Random variable ? with ? integer categories, so that
? ¡Ê 1, 2, ¡­ , ?. Let ?(? = ?) = ?? then parameters ?1, ¡­ , ??
describe the entire distribution of ? with the constraint ¡Æ?
?? = 1.
Generate ?1, ¡­ , ?? from the above distribution and let
?? = ¡Æ
?
?=1
?(?? = ?), i.e. ? indicator returns the number of
observations in the category ?, then the random ? ¡Á 1 vector
? = (?1, ¡­ , ??)¡ä
is said to be from a multinomial distribution
with parameters (?, ?1, ¡­ , ??), i.e.
? ¡« ??(?; ?1, ¡­ , ??)
The probability density function is given by
?(? = ?) = ?(?1 = ?1, ¡­ , ?? = ??) =
?!
?1! ? ??!
??1
1 ? ?
??
?
9
Sample from model
linioanahobackrrus.
kkay.
juwerranisirmon.
ca.
a.
antanena.
janberati.
tofana.
sie.
ba.
jobre.
c.
dwavavatonslylee.
lee.
nieleme.
janni.
Uniform sampling (ignoring data)
uhmktrbvyxofhognbqvotyve.
hsav.
vipbsfwswzdhqzvxhcinpgjajzqczkoedkoqugcpkzxwmok
vakewhmwknucywjkepmgy.
dlclozznmf.
yhkngonmtyukavgvikyyqdtgdhhdwevbehbebmpgbkdkjkf
bzfophfqrjpmsvojfw.
kdgbnefrxvewgcfrmcyepcnomrnzxenginguqucwktqjpsc
yqwmqjmkikbmzggpzffbpymaehatrwarfyj.
gcejhkpjlylbacjevvqwlmsrxsnllc.
fskkrlhibmpvscbbjvddvznkvygwluzrabykxm.
ocuggumlkpss.
kzwpwftcygtgnizuvraebsbakgnxiyehsbqthttlqe.
netszovnjhzqsjdzovzkpjvinjnmu.
fkyrhdaawwflhtyqeqyhuvrkwrbqzskxxarohlwtcresy.
bixgrecua.
10
Maximum likelihood estimation
The likelihood is the product of the probabilities that we can use to
evaluate the ¡®quality¡¯ of the model.
Likelihood of parameter ? defined to be function of the values of
random variables ?1, ¡­ , ?? assumed IID wrt some density ?
?(?; ?1, ¡­ , ??) = ¡Ç ?(??; ?)
The likelihood function gives the probability of observing the given
data as a function of the parameter ?. The maximum likelihood
estimate of ? is value ?
? that ¡®maximizes¡¯ the likelihood, i.e. makes
the observed data most probable or most likely.
? More convenient to (equivalently) minimize the negative log
likelihood (minimum zero)
? Average negative log likelihood is nice quantity to work with,
call this ¡®loss¡¯
11
prob log(prob)
.e: 0.048 -3.041
em: 0.038 -3.279
mm: 0.025 -3.677
ma: 0.390 -0.942
a.: 0.196 -1.630
.o: 0.012 -4.398
ol: 0.078 -2.551
li: 0.178 -1.728
iv: 0.015 -4.187
vi: 0.354 -1.038
ia: 0.138 -1.980
a.: 0.196 -1.630
.a: 0.138 -1.983
av: 0.025 -3.704
va: 0.250 -1.388
a.: 0.196 -1.630
-loglikelihood = 38.786
loss = 2.424
(interpret loss as quality of model)
evaluate ¡°rory¡±
prob log(prob)
.r: 0.051 -2.973
ro: 0.068 -2.682
or: 0.133 -2.014
ry: 0.061 -2.799
y.: 0.205 -1.583
-loglikelihood = 12.051
loss = 2.410
evaluate ¡°rorjq¡±
prob log(prob)
.r: 0.051 -2.973
ro: 0.068 -2.682
or: 0.133 -2.014
rj: 0.002 -6.230
jq: 0.000 -inf
q.: 0.103 -2.274
-loglikelihood = inf
loss = inf
12
Model
smoothing
(simple reg-
ularization)
Adding one
to each of
the counts
avoids zero
probabili-
ties
Generalization:
add ? > 0
to each
count; as ?
gets larger
the {?} get
more
uniform 13
prob log(prob)
.e: 0.048 -3.041
em: 0.038 -3.279
mm: 0.025 -3.675
ma: 0.389 -0.945
a.: 0.196 -1.631
.o: 0.012 -4.396
ol: 0.078 -2.553
li: 0.177 -1.729
iv: 0.015 -4.184
vi: 0.351 -1.048
ia: 0.138 -1.981
a.: 0.196 -1.631
.a: 0.138 -1.984
av: 0.025 -3.704
va: 0.247 -1.397
a.: 0.196 -1.631
-loglikelihood = 38.809
loss = 2.426
evaluate ¡°rory¡±
prob log(prob)
.r: 0.051 -2.973
ro: 0.068 -2.683
or: 0.133 -2.016
ry: 0.061 -2.800
y.: 0.205 -1.586
-loglikelihood = 12.058
loss = 2.412
evaluate ¡°rorjq¡±
prob log(prob)
.r: 0.0512 -2.973
ro: 0.0684 -2.683
or: 0.1331 -2.016
rj: 0.0020 -6.193
jq: 0.0003 -7.982
q.: 0.0970 -2.333
-loglikelihood = 24.180
loss = 4.030
14
Trivial neural network fit
? Armed with a loss function we can fit a neural network
? One input layer (27 units), no hidden layers, one output layer
(27 units), no special activation function
? Number lowercase characters: 0 = ¡®.¡¯, 1 = ¡®a¡¯, ..., 26 = ¡®z¡¯,
e.g. ¡®.richard.¡¯ bigrams represented by integers:
. r
r i
i c
c h
h a
a r
r d
d .
input = tensor([ 0, 18, 9, 3, 8, 1, 18, 4])
target = tensor([18, 9, 3, 8, 1, 18, 4, 0])
15
One hot encoding
Encoding of ¡®.richard¡¯ as the following set of eight binary vectors:
tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]])
makes subsequent numerical computations more e?icient
16
Single neuron
? ? 8 ¡Á 27 matrix is the encoding of .richard
? ? ¡« ?(0, 1) 27 ¡Á 1 vector of weights sampled from standard
normal density (merely convenient starting point)
? ?? vector of dimension 8 ¡Á 1
? The ?? is the activation for every input example
? Aim to produce probability for the next character in sequence
27 neurons
? Compute over all neurons (units) simultaneously ¨C ? now
matrix of dimension 27 ¡Á 27 ¨C yields 8 ¡Á 27 matrix
17
Fitting procedure
? Interpret ? matrix as weights so that ?? are log counts
(logits)
? Normalize to turn into probabilities (¡®softmax¡¯ procedure)
?? =
???
¡Æ
?
?=1
???
for ?? logits and ? output nurons
? Tune ? via NN so that probabilities make sense
? Use loss function to guide the optimization (minimization)
? All steps are differentiable so easy to perform descent
minimization
? PyTorch loss.backward() computes gradients of model
parameters with respect to the loss ¨C ¡®backpropagation¡¯ 18
Backpropagation
(optimiza-
tion) Over
bigrams in
¡®richard.¡¯
#examples: 8
#iterations: 15
step length: -0.1
loss = 3.95747
loss = 3.94469
loss = 3.93191
loss = 3.91915
loss = 3.90640
loss = 3.89366
loss = 3.88094
loss = 3.86822
loss = 3.85552
loss = 3.84283
loss = 3.83015
loss = 3.81748
loss = 3.80482
loss = 3.79218
loss = 3.77955
19
Optimization
over training
sample (80% of
bigram
examples)
#examples: 182473
#iterations: 150
step length: -50
end loss = 2.46398
NN probabilities
close to those
obtained by
maximizing
likelihood
directly
Regularization:
add likelihood
penalty
?(¡Æ ?2
? )/?
(? > 0) as
? ¡ú 0
probabilities get
more uniform
(analogue of ?
used previously)
20
Original model
linioanahobackrrus.
kkay.
juwerranisirmon.
ca.
a.
antanena.
janberati.
tofana.
sie.
ba.
jobre.
c.
dwavavatonslylee.
lee.
nieleme.
janni.
Sample from NN model
linioanahobmckarus.
kfbe.
juwerranisirmon.
ca.
a.
antanena.
jqubquati.
tofana.
sie.
ba.
jobre.
c.
dwavavatonslylee.
lee.
nieleme.
jvoni.
(Random seed start value fixed at same number when running each model)
21
Why bother with neural networks?
? Extension of simple bigram model approach is di?icult
? NN approach easily allows for more complex architectures, e.g.
hidden layers, CNNs, Transformer, RNNs
X1
X2
X3
X4
X5
X6
.
.
.
Xp
A
(1)
1
A
(1)
2
A
(1)
3
A
(1)
4
.
.
.
A
(1)
K1
A
(2)
1
A
(2)
2
A
(2)
3
.
.
.
A
(2)
K2
f0(X) Y0
f1(X) Y1
.
.
.
.
.
.
f9(X) Y9
Hidden
layer L2
Hidden
layer L1
Input
layer
Output
layer
W1
W2
B
Feed-forward NN with two hidden
layers and multiple outputs taken
from James et al. (2021), p.209. The
input layer has ? = 784 units, the
two hidden layers ?1 = 256 and
?2 = 128 units, and the output
layer 10 units. Along with intercepts,
constants referred to as ¡®biases¡¯ by
NN practitioners, this network has
235,146 parameters or weights. The
W1 (of dimension 785 ¡Á256) and
W2 (257 ¡Á128) are matrices of
weights feeding into the first and
second hidden layers ¨C ?1 and ?2,
respectively. Finally, B (129 ¡Á 10) is
another matrix of weights feeding into
the output layer. (Note ? + 1,
?1 + 1 and ?2 + 1 are the first
dimensions of W1, W2 and B
matrices, respectively, to include the
biases.)
22
? Flexible NN activation functions suggest improved modelling
Typical NN activation functions: a sigmoid (various forms exist), Rectified
Linear Unit (ReLU) scaled by a divisor of five, and tanh
? Basic idea for predicting next character remains the same for
sentence construction
23
Increasing NN complexity and training time for makemore
24
nanoGPT ¨C generating Shakespearean prose
Andrej Karpathy https://github.com/karpathy/nanoGPT
with open('../data/input.txt', 'r', encoding='utf-8') as f:
text = f.read()
File input.txt contains 40,000 lines of Shakespeare from a variety of plays
First Citizen: Before we proceed any further, hear
me speak. All: Speak, speak. First Citizen: You
are all resolved rather to die than to famish?
All: Resolved. resolved. First Citizen: First,
you know Caius Marcius is chief enemy to the
people. All: We know't, we know't. First
Citizen: Let us kill him, and we'll have corn at
our own price. Is't a verdict? All: No more
talking on't; let it be done: away, away! Second
Citizen: One word, good citizens. First Citizen:
We are accounted poor citizens, the patricians
good. What authority surfeits on would relieve us:
if they would yield us but the superfluity, while
it were wholesome, we might guess they relieved us
humanely; but they think we are too dear: the
leanness that afflicts us, the object of our
misery, is as an inventory to particularise their
abundance; our sufferance is a gain to them Let us
revenge this with our pikes, ere we become rakes:
for the gods know I speak this in hunger for
bread, not in thirst for revenge. Second Citizen:
Would you proceed especially against Caius
Marcius? All: Against him first: he's a very dog
to the commonalty. Second Citizen: Consider you
what services he has done for his country?
Length of dataset in characters: 1115394
25
Vocabulary
A vocabulary is usually a set of words making up the language but
as we are concerned with this example at a character level, the
vocabulary consists of individual Latin letters and symbols.
The Shakespeare dataset consists of a newline character, a space
and 11 special characters plus 52 Latin letters (26 uppercase and
26 lowercase), i.e.
!$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
size of vocabulary: 65
26
Tokenization
Create a mapping from characters to numbers (integers)
encode: [13, 1, 42, 39, 45, 45, 43, 56, 1, 21, 1, 57, 43, 43]
decode: A dagger I see
More sophisticated tokenization methods exist:
? SentencePiece implements subword units (e.g.,
byte-pair-encoding?
(BPE) and unigram language model) with
the extension of direct training from raw sentences
? tiktoken is a fast BPE tokenizer for use with OpenAI¡¯s models
?
Byte Pair Encoding (BPE) (Gage (2019)) is a simple data compression
technique that iteratively replaces the most frequent pair of bytes in a sequence
with a single, unused byte.
27
import torch # we use PyTorch: https://pytorch.org
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:200]) # tokenization of the first 200 characters from
# input.txt but we've tokenized all ?1.1m chars
torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44,
53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63,
1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1,
57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49,
6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47,
58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50,
1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58,
53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47,
57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42,
8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58,
1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63,
53, 59])
28
Training the model
# Split data into train and test sets
n = int(0.9*len(data)) # 90% train: 10% test
train = data[:n]
test = data[n:]
block_size = 8
train[:block_size+1]
x = train[:block_size]
y = train[1:block_size+1]
for t in range(block_size):
context = x[:t+1]
target = y[t]
print(f"when input is {context} the target is: {target}")
29
The target is the next character in the sequence y[t] given the
prior inputs x[:t+1].
when input is tensor([18]) the target is: 47
when input is tensor([18, 47]) the target is: 56
when input is tensor([18, 47, 56]) the target is: 57
when input is tensor([18, 47, 56, 57]) the target is: 58
when input is tensor([18, 47, 56, 57, 58]) the target is: 1
when input is tensor([18, 47, 56, 57, 58, 1]) the target is: 15
when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target is: 47
when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target is: 58
We can process more than one block at a time¡­
inputs:
torch.Size([5, 8])
tensor([[57, 1, 46, 47, 57, 1, 50, 53],
[ 1, 58, 46, 43, 56, 43, 1, 41],
[17, 26, 15, 17, 10, 0, 32, 53],
[57, 58, 6, 1, 61, 47, 58, 46],
[ 6, 0, 14, 43, 44, 53, 56, 43]])
targets:
torch.Size([5, 8])
tensor([[ 1, 46, 47, 57, 1, 50, 53, 60],
[58, 46, 43, 56, 43, 1, 41, 39],
[26, 15, 17, 10, 0, 32, 53, 1],
[58, 6, 1, 61, 47, 58, 46, 0],
[ 0, 14, 43, 44, 53, 56, 43, 1]])
30
Self-attention mechanism
The self-attention mechanism is a crucial component in large
language models (LLMs)
? Introduced in the Transformer architecture Vaswani et al.
(2017), now widely adopted in LLM builds
? Self-attention refers to model¡¯s ability to weigh different parts
of input sequence differently when processing each element
? Allows model to focus more on relevant tokens while
generating a particular word in the sequence
? Attention weights are derived that represent contribution of
each word in the sequence in relation to current word¡¯s
understanding or generation
31
Contextualized representation
? Weighted sum added to original word¡¯s embedding, producing
contextualized representation
? This contextualized representation captures influence of other
words in sequence on current word
? By allowing model to dynamically attend to different parts of
the input sequence, long-range dependencies and relationships
between words are captured
? Helps model ¡®consider context¡¯ when making predictions or
generating coherent responses
32
Training and average loss estimation is accomplished using a
feedforward neural network combined with the bigram model.
Model computes attention scores for each input token against all
other tokens.
? Attention scores determine the relative importance of each
input token
? Mechanism allows each token to gather information from all
other tokens in the sequence
Training proceeds iteratively by sampling from the underlying
dataset not on all 1.1m characters in the dataset.
33
Transformer architecture
Graphical description of the transformer
taken from Vaswani et al. (2017).
The encoder maps an input sequence of
symbol representations (?1, ¡­ , ??) to
a sequence of continuous
representations ? = (?1, ¡­ , ??). Given
?, the decoder generates an output
sequence (?1, ¡­ , ??) of symbols one
element at a time. At each step, the
model is auto-regressive ¨C consuming
the previously generated symbols as
additional input when generating the
next symbol.
The Transformer follows this overall
architecture using stacked self-attention
and point-wise, fully connected layers
for both the encoder and decoder,
shown in the left and right halves of
the diagram, respectively. 34
Fakespeare
As bless your prosent'st subful a intent make fully of his thus;
Comile, and that we! loss that be this poce he Woo, commed that
us how Ty bloody's my despery lorde gromiowerss me askiss!' Whe
compeased heaveds like, Cwerracions commone it unport: In your
strange, Why years should blut repshies, We it speakingd all
well: shall we their our suyal.
BENVOLIO: Are, hi I have confort el it, any I prod.
CORIOLIO: No too, quitaring thee! Are our at your bloody---with
your orearded far oldies, And them he lard many to like, and
witht to brount As no commonne then he dent.
YORK: What your by wellass that winderingues, A moothers of it
you heavet, but but shad, Living thank his serving; I know, Is am
thee in to brest thy heaur it: met and I what, What no brothing
our the right ones?
BENVOLIO: Sh
35
LLM interaction
My input: The best thing about statistics is its
ability to
Next word output from GPT-3.5 in sentence:
Next word Probability
distill 9.0%
provide 5.2%
reveal 6.2%
systematically 2.6%
uncover 2.5%
? Output words are real based on input (probabilities fictional)
? Ranked list of next words that might follow input
36
? Likely to generate gibberish if only probabilities of words are
considered
? As with letter combinations, taking into account probabilities
of pairs or longer ?-grams of words is sensible
? Su?iciently long ?-grams should yield useful natural language
model
? Include temperature parameter so next word isn¡¯t always the
one with highest probability
? varies the output (less boring) given identical input
37
LLM sentence generation
The best thing about statistics is its ability to provide
The best thing about statistics is its ability to provide a
The best thing about statistics is its ability to provide a
systematic
The best thing about statistics is its ability to provide a
systematic and
The best thing about statistics is its ability to provide a
systematic and objective
The best thing about statistics is its ability to provide a
systematic and objective method
The best thing about statistics is its ability to provide a
systematic and objective method for
The best thing about statistics is its ability to provide a
systematic and objective method for collecting
...
38
Varying the output: Five GPT-3.5 responses to same input
1. The best thing about statistics is its ability to provide a systematic and
objective method for collecting, analyzing, interpreting, presenting,
and organizing data
2. The best thing about statistics is its ability to distill complex
information into meaningful insights, facilitating evidence-based
decision-making and informed understanding of the world.
3. The best thing about statistics is its ability to reveal meaningful
patterns, trends, and insights hidden within complex data.
4. The best thing about statistics is its ability to systematically analyze
and interpret data, providing objective and evidence-based insights
that inform decision-making, problem-solving, and understanding
complex phenomena.
5. The best thing about statistics is its ability to distill complex information
into meaningful patterns, trends, and insights, providing a systematic
and objective way to understand, interpret, and communicate data.
39
Current LMM ecosystem
Evolution of existing large language models (model size larger than 10bn
parameters) in recent years taken from Zhao et al. (2023). (LLMs with publicly
available model checkpoints are highlighted in yellow.)
40
LLM training
Chunk of internet
¡Ö 4.7TB pre-training
data 1.4TB tokens
(image: https://www.opte.org/the-internet)
GPU cluster of
NVIDIA A100 80G
running 21 days
(Cost ¡Ö $20m*)
*as generally reported (difficult to estimate
precisely)
LLaMA-65bn
similar LLM
Training Data Compute Ability LLM
? Generally talk in terms of dataset size (?), model size (?),
training compute (?) and budgets for these three elements
? So called ¡®scaling laws¡¯ attempt to determine compute-optimal
training for LLMs in terms of ?, ? and ? wrt model loss
41
Observations
? More (parameters) appears superior to fewer in the context of
NNs
? Possible minimization of loss might be better the higher the
dimension
? lots of search directions to look in might help avoid local
minimum
? Different collections of weights give rise to NNs that have
much the same performance on the same problem
? Slightly different fit can give rise to dramatically different
results out-of-sample
? The same NN architecture seems to work well for quite
different tasks
See Wolfram (2023) for more musings here.
42
Practical use
LLMs are good at providing answers to specific questions:
? producing a skeleton for written works
? giving guidance on legal and compliance matters
? undertaking language translation
? generating computer code
? summarizing text and extracting key information
? creating graphics
LLMs also claim to do a lot more, including:
? data mining and analysis
? building virtual assistants
? text classification and document organization
? speech-to-text and text-to-speech conversion
? language learning and education more generally
43
What¡¯s remarkable
? Wolfram points out that a NN with about as many
connections as human brains have neurons (around 86bn)
does a surprisingly good job of generating human language
(any language it seems)
? Is human language simpler than we might like to believe it is?
? The syntax of human language isn¡¯t a random jumble of
words, it has structure
? NNs appear adept at implicit encoding of this structure
? We can¡¯t empirically decode what a LLM has put together (at
present) but LMMs may aid self-understanding of human
language in the future
44
What LLMs are not
? Absolutely not artificial intelligence [IMHO]
? ¡°Something in silicon that behaves in much the same way as a
reasonably intelligent human being in any situation¡± is my
definition of an AGI
? Cannot reason and plan; see Kambhampati (2023) Neither
can it be bargained with, reasoned with; it doesn¡¯t feel pity or
remorse or fear and it absolutely will not stop ¡­ [until one
turns the power off]
? Struggles with basic arithmetic (e.g. GPT-4 obtains 59%
accuracy with three-digit multiplication) let alone conditional
probability
45
? Limitations that stem from the nature of training and current
state of the technology
? More data and greater compute ability isn¡¯t the answer; see
Villalobos et al. (2024)
? Plausible-sounding but incorrect or nonsensical answers
? Sensitivity to input phrasing
? Inability to ask clarifying questions when faced with
ambiguous queries
? Biases in training data often replicated
46
? Outputs based purely on the training text data but ¡®based on¡¯
doesn¡¯t mean an LLM can¡¯t
extrapolate/hallucinate/confabulate
? LLMs are ¡®hackable¡¯ computer system and user protections
(human-directed filters) can be bypassed via prompt injections
both trivial and more complex
? Output very often bland, no match for a true subject matter
expert
47
? Questions
My responses are limited. You must ask the right questions.
48
References
Gage, P. (2019) A new algorithm for data compression. The C Users Journal, 12, 23¨C38. Available at:
https://typeset.io/papers/a-new-algorithm-for-data-compression-3htk4tchd5.
James, G., Witten, D., Hastie, T., et al. (2021) An Introduction to Statistical Learning with Applications in
R. Second Edition. Springer. Available at: https://www.statlearning.com/.
Kambhampati, S. (2023) Beauty, lies & ChatGPT: Welcome to the post-truth world. The Hill. Available
at: https://thehill.com/opinion/technology/3861182-beauty-lies-chatgpt-welcome-to-the-post-truth-
world/.
Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention is all you need. In: Advances in neural
information processing systems (eds I Guyon, UV Luxburg, S Bengio, et al.), 2017. Curran Associates,
Inc. Available at: https:
//proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Villalobos, P., Ho, A., Sevilla, J., et al. (2024) Will we run out of data? Limits of LLM scaling based on
human-generated data. Available at: https://arxiv.org/abs/2211.04325.
Wolfram, S. (2023) What Is ChatGPT Doing ¡­ and Why Does It Work? Wolfram Media. Available at:
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/.
Zhao, W. X., Zhou, K., Li, J., et al. (2023) A survey of large language models. Available at:
https://arxiv.org/abs/2303.18223.

More Related Content

Details on how large language models are constructed

  • 1. LLM core slides Richard Saldanha richard@oxquant.com Dec 2023: University of Oxford¡ªabstract; Apr 2024: Queen Mary University of London¡ªECOM198 NLP lecture; Jun: London Mathematical Society Research School, LSE¡ªevent listing; Jun: Brighton College¡ªnews; and Sep: Institute of Science and Technology Conference 2024, Lancaster University.
  • 2. Introduction ? Natural Language Processing (NLP) is the big picture ? Large Language Models (LLMs) are specific NLP tools, a more recent approach ? NLP is an entire discipline of how computers ¡®understand¡¯ and work with human language ? task examples: machine translation, sentiment analysis, text summarization, text generation, speech recognition and much more ? LLMs are trained on massive amounts of data, perform many NLP tasks with high proficiency, especially tasks involving the generation of text 1
  • 3. ? LLMs can handle complex aspects of language such as context, grammar and even sarcasm This talk focuses specifically on LLMs, how they are put together and their general characteristics Intended as a sketch of the mechanics rather than deep dive into all the intricacies of building LLMs 2
  • 4. Talk overview ? Look at toy character level language model ? direct modelling of character pairs (bigram language) ? maximum likelihood estimation ? equivalent neural network fit ? Neural network extension ? Generating Shakespearean language ? LLM generalization ? General observations 3
  • 5. makemore ¨C character level language model example Andrej Karpathy https://github.com/karpathy/makemore names = open("../data/names.txt", "r").read().splitlines() ['emma', 'olivia', 'ava', 'isabella', 'sophia', 'charlotte'] ... ['zykeem', 'zylas', 'zyran', 'zyrie', 'zyron', 'zzyzx'] Names in file: 32033 Shortest name: 2 characters Longest name: 15 characters All names lower case Latin letters (26) with no special characters Dataset taken from https://www.ssa.gov/oact/babynames/ 4
  • 6. Examine character pair combinations ? Form simple bigram language model (two-letter combinations) ? Pair combinations making up the first three names in dataset (shown right) ¨C emma, olivia, ava ? Aim to predict the next character in sequence given current character ? immediate structure to these data ? Note use of special start and end character ¡®.¡¯ . e e m m m m a a . . o o l l i i v v i i a a . . a a v v a a . 5
  • 8. Displaying just the first 9 ¡Á 9 block of data (ease of viewing) We still compute over all 27 ¡Á 27 character pairs 7
  • 9. Convert counts to probabili- ties (divide by row sums) Start with top row and select 1st letter (following ¡®.¡¯) then choose 2nd letter from row corresponding to 1st letter selected and continue (by row based on preceding letter) until ¡®.¡¯ is selected 8
  • 10. Rows form multinomial distributions Random variable ? with ? integer categories, so that ? ¡Ê 1, 2, ¡­ , ?. Let ?(? = ?) = ?? then parameters ?1, ¡­ , ?? describe the entire distribution of ? with the constraint ¡Æ? ?? = 1. Generate ?1, ¡­ , ?? from the above distribution and let ?? = ¡Æ ? ?=1 ?(?? = ?), i.e. ? indicator returns the number of observations in the category ?, then the random ? ¡Á 1 vector ? = (?1, ¡­ , ??)¡ä is said to be from a multinomial distribution with parameters (?, ?1, ¡­ , ??), i.e. ? ¡« ??(?; ?1, ¡­ , ??) The probability density function is given by ?(? = ?) = ?(?1 = ?1, ¡­ , ?? = ??) = ?! ?1! ? ??! ??1 1 ? ? ?? ? 9
  • 11. Sample from model linioanahobackrrus. kkay. juwerranisirmon. ca. a. antanena. janberati. tofana. sie. ba. jobre. c. dwavavatonslylee. lee. nieleme. janni. Uniform sampling (ignoring data) uhmktrbvyxofhognbqvotyve. hsav. vipbsfwswzdhqzvxhcinpgjajzqczkoedkoqugcpkzxwmok vakewhmwknucywjkepmgy. dlclozznmf. yhkngonmtyukavgvikyyqdtgdhhdwevbehbebmpgbkdkjkf bzfophfqrjpmsvojfw. kdgbnefrxvewgcfrmcyepcnomrnzxenginguqucwktqjpsc yqwmqjmkikbmzggpzffbpymaehatrwarfyj. gcejhkpjlylbacjevvqwlmsrxsnllc. fskkrlhibmpvscbbjvddvznkvygwluzrabykxm. ocuggumlkpss. kzwpwftcygtgnizuvraebsbakgnxiyehsbqthttlqe. netszovnjhzqsjdzovzkpjvinjnmu. fkyrhdaawwflhtyqeqyhuvrkwrbqzskxxarohlwtcresy. bixgrecua. 10
  • 12. Maximum likelihood estimation The likelihood is the product of the probabilities that we can use to evaluate the ¡®quality¡¯ of the model. Likelihood of parameter ? defined to be function of the values of random variables ?1, ¡­ , ?? assumed IID wrt some density ? ?(?; ?1, ¡­ , ??) = ¡Ç ?(??; ?) The likelihood function gives the probability of observing the given data as a function of the parameter ?. The maximum likelihood estimate of ? is value ? ? that ¡®maximizes¡¯ the likelihood, i.e. makes the observed data most probable or most likely. ? More convenient to (equivalently) minimize the negative log likelihood (minimum zero) ? Average negative log likelihood is nice quantity to work with, call this ¡®loss¡¯ 11
  • 13. prob log(prob) .e: 0.048 -3.041 em: 0.038 -3.279 mm: 0.025 -3.677 ma: 0.390 -0.942 a.: 0.196 -1.630 .o: 0.012 -4.398 ol: 0.078 -2.551 li: 0.178 -1.728 iv: 0.015 -4.187 vi: 0.354 -1.038 ia: 0.138 -1.980 a.: 0.196 -1.630 .a: 0.138 -1.983 av: 0.025 -3.704 va: 0.250 -1.388 a.: 0.196 -1.630 -loglikelihood = 38.786 loss = 2.424 (interpret loss as quality of model) evaluate ¡°rory¡± prob log(prob) .r: 0.051 -2.973 ro: 0.068 -2.682 or: 0.133 -2.014 ry: 0.061 -2.799 y.: 0.205 -1.583 -loglikelihood = 12.051 loss = 2.410 evaluate ¡°rorjq¡± prob log(prob) .r: 0.051 -2.973 ro: 0.068 -2.682 or: 0.133 -2.014 rj: 0.002 -6.230 jq: 0.000 -inf q.: 0.103 -2.274 -loglikelihood = inf loss = inf 12
  • 14. Model smoothing (simple reg- ularization) Adding one to each of the counts avoids zero probabili- ties Generalization: add ? > 0 to each count; as ? gets larger the {?} get more uniform 13
  • 15. prob log(prob) .e: 0.048 -3.041 em: 0.038 -3.279 mm: 0.025 -3.675 ma: 0.389 -0.945 a.: 0.196 -1.631 .o: 0.012 -4.396 ol: 0.078 -2.553 li: 0.177 -1.729 iv: 0.015 -4.184 vi: 0.351 -1.048 ia: 0.138 -1.981 a.: 0.196 -1.631 .a: 0.138 -1.984 av: 0.025 -3.704 va: 0.247 -1.397 a.: 0.196 -1.631 -loglikelihood = 38.809 loss = 2.426 evaluate ¡°rory¡± prob log(prob) .r: 0.051 -2.973 ro: 0.068 -2.683 or: 0.133 -2.016 ry: 0.061 -2.800 y.: 0.205 -1.586 -loglikelihood = 12.058 loss = 2.412 evaluate ¡°rorjq¡± prob log(prob) .r: 0.0512 -2.973 ro: 0.0684 -2.683 or: 0.1331 -2.016 rj: 0.0020 -6.193 jq: 0.0003 -7.982 q.: 0.0970 -2.333 -loglikelihood = 24.180 loss = 4.030 14
  • 16. Trivial neural network fit ? Armed with a loss function we can fit a neural network ? One input layer (27 units), no hidden layers, one output layer (27 units), no special activation function ? Number lowercase characters: 0 = ¡®.¡¯, 1 = ¡®a¡¯, ..., 26 = ¡®z¡¯, e.g. ¡®.richard.¡¯ bigrams represented by integers: . r r i i c c h h a a r r d d . input = tensor([ 0, 18, 9, 3, 8, 1, 18, 4]) target = tensor([18, 9, 3, 8, 1, 18, 4, 0]) 15
  • 17. One hot encoding Encoding of ¡®.richard¡¯ as the following set of eight binary vectors: tensor([[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) makes subsequent numerical computations more e?icient 16
  • 18. Single neuron ? ? 8 ¡Á 27 matrix is the encoding of .richard ? ? ¡« ?(0, 1) 27 ¡Á 1 vector of weights sampled from standard normal density (merely convenient starting point) ? ?? vector of dimension 8 ¡Á 1 ? The ?? is the activation for every input example ? Aim to produce probability for the next character in sequence 27 neurons ? Compute over all neurons (units) simultaneously ¨C ? now matrix of dimension 27 ¡Á 27 ¨C yields 8 ¡Á 27 matrix 17
  • 19. Fitting procedure ? Interpret ? matrix as weights so that ?? are log counts (logits) ? Normalize to turn into probabilities (¡®softmax¡¯ procedure) ?? = ??? ¡Æ ? ?=1 ??? for ?? logits and ? output nurons ? Tune ? via NN so that probabilities make sense ? Use loss function to guide the optimization (minimization) ? All steps are differentiable so easy to perform descent minimization ? PyTorch loss.backward() computes gradients of model parameters with respect to the loss ¨C ¡®backpropagation¡¯ 18
  • 20. Backpropagation (optimiza- tion) Over bigrams in ¡®richard.¡¯ #examples: 8 #iterations: 15 step length: -0.1 loss = 3.95747 loss = 3.94469 loss = 3.93191 loss = 3.91915 loss = 3.90640 loss = 3.89366 loss = 3.88094 loss = 3.86822 loss = 3.85552 loss = 3.84283 loss = 3.83015 loss = 3.81748 loss = 3.80482 loss = 3.79218 loss = 3.77955 19
  • 21. Optimization over training sample (80% of bigram examples) #examples: 182473 #iterations: 150 step length: -50 end loss = 2.46398 NN probabilities close to those obtained by maximizing likelihood directly Regularization: add likelihood penalty ?(¡Æ ?2 ? )/? (? > 0) as ? ¡ú 0 probabilities get more uniform (analogue of ? used previously) 20
  • 22. Original model linioanahobackrrus. kkay. juwerranisirmon. ca. a. antanena. janberati. tofana. sie. ba. jobre. c. dwavavatonslylee. lee. nieleme. janni. Sample from NN model linioanahobmckarus. kfbe. juwerranisirmon. ca. a. antanena. jqubquati. tofana. sie. ba. jobre. c. dwavavatonslylee. lee. nieleme. jvoni. (Random seed start value fixed at same number when running each model) 21
  • 23. Why bother with neural networks? ? Extension of simple bigram model approach is di?icult ? NN approach easily allows for more complex architectures, e.g. hidden layers, CNNs, Transformer, RNNs X1 X2 X3 X4 X5 X6 . . . Xp A (1) 1 A (1) 2 A (1) 3 A (1) 4 . . . A (1) K1 A (2) 1 A (2) 2 A (2) 3 . . . A (2) K2 f0(X) Y0 f1(X) Y1 . . . . . . f9(X) Y9 Hidden layer L2 Hidden layer L1 Input layer Output layer W1 W2 B Feed-forward NN with two hidden layers and multiple outputs taken from James et al. (2021), p.209. The input layer has ? = 784 units, the two hidden layers ?1 = 256 and ?2 = 128 units, and the output layer 10 units. Along with intercepts, constants referred to as ¡®biases¡¯ by NN practitioners, this network has 235,146 parameters or weights. The W1 (of dimension 785 ¡Á256) and W2 (257 ¡Á128) are matrices of weights feeding into the first and second hidden layers ¨C ?1 and ?2, respectively. Finally, B (129 ¡Á 10) is another matrix of weights feeding into the output layer. (Note ? + 1, ?1 + 1 and ?2 + 1 are the first dimensions of W1, W2 and B matrices, respectively, to include the biases.) 22
  • 24. ? Flexible NN activation functions suggest improved modelling Typical NN activation functions: a sigmoid (various forms exist), Rectified Linear Unit (ReLU) scaled by a divisor of five, and tanh ? Basic idea for predicting next character remains the same for sentence construction 23
  • 25. Increasing NN complexity and training time for makemore 24
  • 26. nanoGPT ¨C generating Shakespearean prose Andrej Karpathy https://github.com/karpathy/nanoGPT with open('../data/input.txt', 'r', encoding='utf-8') as f: text = f.read() File input.txt contains 40,000 lines of Shakespeare from a variety of plays First Citizen: Before we proceed any further, hear me speak. All: Speak, speak. First Citizen: You are all resolved rather to die than to famish? All: Resolved. resolved. First Citizen: First, you know Caius Marcius is chief enemy to the people. All: We know't, we know't. First Citizen: Let us kill him, and we'll have corn at our own price. Is't a verdict? All: No more talking on't; let it be done: away, away! Second Citizen: One word, good citizens. First Citizen: We are accounted poor citizens, the patricians good. What authority surfeits on would relieve us: if they would yield us but the superfluity, while it were wholesome, we might guess they relieved us humanely; but they think we are too dear: the leanness that afflicts us, the object of our misery, is as an inventory to particularise their abundance; our sufferance is a gain to them Let us revenge this with our pikes, ere we become rakes: for the gods know I speak this in hunger for bread, not in thirst for revenge. Second Citizen: Would you proceed especially against Caius Marcius? All: Against him first: he's a very dog to the commonalty. Second Citizen: Consider you what services he has done for his country? Length of dataset in characters: 1115394 25
  • 27. Vocabulary A vocabulary is usually a set of words making up the language but as we are concerned with this example at a character level, the vocabulary consists of individual Latin letters and symbols. The Shakespeare dataset consists of a newline character, a space and 11 special characters plus 52 Latin letters (26 uppercase and 26 lowercase), i.e. !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz size of vocabulary: 65 26
  • 28. Tokenization Create a mapping from characters to numbers (integers) encode: [13, 1, 42, 39, 45, 45, 43, 56, 1, 21, 1, 57, 43, 43] decode: A dagger I see More sophisticated tokenization methods exist: ? SentencePiece implements subword units (e.g., byte-pair-encoding? (BPE) and unigram language model) with the extension of direct training from raw sentences ? tiktoken is a fast BPE tokenizer for use with OpenAI¡¯s models ? Byte Pair Encoding (BPE) (Gage (2019)) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. 27
  • 29. import torch # we use PyTorch: https://pytorch.org data = torch.tensor(encode(text), dtype=torch.long) print(data.shape, data.dtype) print(data[:200]) # tokenization of the first 200 characters from # input.txt but we've tokenized all ?1.1m chars torch.Size([1115394]) torch.int64 tensor([18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8, 0, 0, 13, 50, 50, 10, 0, 31, 54, 43, 39, 49, 6, 1, 57, 54, 43, 39, 49, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 37, 53, 59, 1, 39, 56, 43, 1, 39, 50, 50, 1, 56, 43, 57, 53, 50, 60, 43, 42, 1, 56, 39, 58, 46, 43, 56, 1, 58, 53, 1, 42, 47, 43, 1, 58, 46, 39, 52, 1, 58, 53, 1, 44, 39, 51, 47, 57, 46, 12, 0, 0, 13, 50, 50, 10, 0, 30, 43, 57, 53, 50, 60, 43, 42, 8, 1, 56, 43, 57, 53, 50, 60, 43, 42, 8, 0, 0, 18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 18, 47, 56, 57, 58, 6, 1, 63, 53, 59]) 28
  • 30. Training the model # Split data into train and test sets n = int(0.9*len(data)) # 90% train: 10% test train = data[:n] test = data[n:] block_size = 8 train[:block_size+1] x = train[:block_size] y = train[1:block_size+1] for t in range(block_size): context = x[:t+1] target = y[t] print(f"when input is {context} the target is: {target}") 29
  • 31. The target is the next character in the sequence y[t] given the prior inputs x[:t+1]. when input is tensor([18]) the target is: 47 when input is tensor([18, 47]) the target is: 56 when input is tensor([18, 47, 56]) the target is: 57 when input is tensor([18, 47, 56, 57]) the target is: 58 when input is tensor([18, 47, 56, 57, 58]) the target is: 1 when input is tensor([18, 47, 56, 57, 58, 1]) the target is: 15 when input is tensor([18, 47, 56, 57, 58, 1, 15]) the target is: 47 when input is tensor([18, 47, 56, 57, 58, 1, 15, 47]) the target is: 58 We can process more than one block at a time¡­ inputs: torch.Size([5, 8]) tensor([[57, 1, 46, 47, 57, 1, 50, 53], [ 1, 58, 46, 43, 56, 43, 1, 41], [17, 26, 15, 17, 10, 0, 32, 53], [57, 58, 6, 1, 61, 47, 58, 46], [ 6, 0, 14, 43, 44, 53, 56, 43]]) targets: torch.Size([5, 8]) tensor([[ 1, 46, 47, 57, 1, 50, 53, 60], [58, 46, 43, 56, 43, 1, 41, 39], [26, 15, 17, 10, 0, 32, 53, 1], [58, 6, 1, 61, 47, 58, 46, 0], [ 0, 14, 43, 44, 53, 56, 43, 1]]) 30
  • 32. Self-attention mechanism The self-attention mechanism is a crucial component in large language models (LLMs) ? Introduced in the Transformer architecture Vaswani et al. (2017), now widely adopted in LLM builds ? Self-attention refers to model¡¯s ability to weigh different parts of input sequence differently when processing each element ? Allows model to focus more on relevant tokens while generating a particular word in the sequence ? Attention weights are derived that represent contribution of each word in the sequence in relation to current word¡¯s understanding or generation 31
  • 33. Contextualized representation ? Weighted sum added to original word¡¯s embedding, producing contextualized representation ? This contextualized representation captures influence of other words in sequence on current word ? By allowing model to dynamically attend to different parts of the input sequence, long-range dependencies and relationships between words are captured ? Helps model ¡®consider context¡¯ when making predictions or generating coherent responses 32
  • 34. Training and average loss estimation is accomplished using a feedforward neural network combined with the bigram model. Model computes attention scores for each input token against all other tokens. ? Attention scores determine the relative importance of each input token ? Mechanism allows each token to gather information from all other tokens in the sequence Training proceeds iteratively by sampling from the underlying dataset not on all 1.1m characters in the dataset. 33
  • 35. Transformer architecture Graphical description of the transformer taken from Vaswani et al. (2017). The encoder maps an input sequence of symbol representations (?1, ¡­ , ??) to a sequence of continuous representations ? = (?1, ¡­ , ??). Given ?, the decoder generates an output sequence (?1, ¡­ , ??) of symbols one element at a time. At each step, the model is auto-regressive ¨C consuming the previously generated symbols as additional input when generating the next symbol. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of the diagram, respectively. 34
  • 36. Fakespeare As bless your prosent'st subful a intent make fully of his thus; Comile, and that we! loss that be this poce he Woo, commed that us how Ty bloody's my despery lorde gromiowerss me askiss!' Whe compeased heaveds like, Cwerracions commone it unport: In your strange, Why years should blut repshies, We it speakingd all well: shall we their our suyal. BENVOLIO: Are, hi I have confort el it, any I prod. CORIOLIO: No too, quitaring thee! Are our at your bloody---with your orearded far oldies, And them he lard many to like, and witht to brount As no commonne then he dent. YORK: What your by wellass that winderingues, A moothers of it you heavet, but but shad, Living thank his serving; I know, Is am thee in to brest thy heaur it: met and I what, What no brothing our the right ones? BENVOLIO: Sh 35
  • 37. LLM interaction My input: The best thing about statistics is its ability to Next word output from GPT-3.5 in sentence: Next word Probability distill 9.0% provide 5.2% reveal 6.2% systematically 2.6% uncover 2.5% ? Output words are real based on input (probabilities fictional) ? Ranked list of next words that might follow input 36
  • 38. ? Likely to generate gibberish if only probabilities of words are considered ? As with letter combinations, taking into account probabilities of pairs or longer ?-grams of words is sensible ? Su?iciently long ?-grams should yield useful natural language model ? Include temperature parameter so next word isn¡¯t always the one with highest probability ? varies the output (less boring) given identical input 37
  • 39. LLM sentence generation The best thing about statistics is its ability to provide The best thing about statistics is its ability to provide a The best thing about statistics is its ability to provide a systematic The best thing about statistics is its ability to provide a systematic and The best thing about statistics is its ability to provide a systematic and objective The best thing about statistics is its ability to provide a systematic and objective method The best thing about statistics is its ability to provide a systematic and objective method for The best thing about statistics is its ability to provide a systematic and objective method for collecting ... 38
  • 40. Varying the output: Five GPT-3.5 responses to same input 1. The best thing about statistics is its ability to provide a systematic and objective method for collecting, analyzing, interpreting, presenting, and organizing data 2. The best thing about statistics is its ability to distill complex information into meaningful insights, facilitating evidence-based decision-making and informed understanding of the world. 3. The best thing about statistics is its ability to reveal meaningful patterns, trends, and insights hidden within complex data. 4. The best thing about statistics is its ability to systematically analyze and interpret data, providing objective and evidence-based insights that inform decision-making, problem-solving, and understanding complex phenomena. 5. The best thing about statistics is its ability to distill complex information into meaningful patterns, trends, and insights, providing a systematic and objective way to understand, interpret, and communicate data. 39
  • 41. Current LMM ecosystem Evolution of existing large language models (model size larger than 10bn parameters) in recent years taken from Zhao et al. (2023). (LLMs with publicly available model checkpoints are highlighted in yellow.) 40
  • 42. LLM training Chunk of internet ¡Ö 4.7TB pre-training data 1.4TB tokens (image: https://www.opte.org/the-internet) GPU cluster of NVIDIA A100 80G running 21 days (Cost ¡Ö $20m*) *as generally reported (difficult to estimate precisely) LLaMA-65bn similar LLM Training Data Compute Ability LLM ? Generally talk in terms of dataset size (?), model size (?), training compute (?) and budgets for these three elements ? So called ¡®scaling laws¡¯ attempt to determine compute-optimal training for LLMs in terms of ?, ? and ? wrt model loss 41
  • 43. Observations ? More (parameters) appears superior to fewer in the context of NNs ? Possible minimization of loss might be better the higher the dimension ? lots of search directions to look in might help avoid local minimum ? Different collections of weights give rise to NNs that have much the same performance on the same problem ? Slightly different fit can give rise to dramatically different results out-of-sample ? The same NN architecture seems to work well for quite different tasks See Wolfram (2023) for more musings here. 42
  • 44. Practical use LLMs are good at providing answers to specific questions: ? producing a skeleton for written works ? giving guidance on legal and compliance matters ? undertaking language translation ? generating computer code ? summarizing text and extracting key information ? creating graphics LLMs also claim to do a lot more, including: ? data mining and analysis ? building virtual assistants ? text classification and document organization ? speech-to-text and text-to-speech conversion ? language learning and education more generally 43
  • 45. What¡¯s remarkable ? Wolfram points out that a NN with about as many connections as human brains have neurons (around 86bn) does a surprisingly good job of generating human language (any language it seems) ? Is human language simpler than we might like to believe it is? ? The syntax of human language isn¡¯t a random jumble of words, it has structure ? NNs appear adept at implicit encoding of this structure ? We can¡¯t empirically decode what a LLM has put together (at present) but LMMs may aid self-understanding of human language in the future 44
  • 46. What LLMs are not ? Absolutely not artificial intelligence [IMHO] ? ¡°Something in silicon that behaves in much the same way as a reasonably intelligent human being in any situation¡± is my definition of an AGI ? Cannot reason and plan; see Kambhampati (2023) Neither can it be bargained with, reasoned with; it doesn¡¯t feel pity or remorse or fear and it absolutely will not stop ¡­ [until one turns the power off] ? Struggles with basic arithmetic (e.g. GPT-4 obtains 59% accuracy with three-digit multiplication) let alone conditional probability 45
  • 47. ? Limitations that stem from the nature of training and current state of the technology ? More data and greater compute ability isn¡¯t the answer; see Villalobos et al. (2024) ? Plausible-sounding but incorrect or nonsensical answers ? Sensitivity to input phrasing ? Inability to ask clarifying questions when faced with ambiguous queries ? Biases in training data often replicated 46
  • 48. ? Outputs based purely on the training text data but ¡®based on¡¯ doesn¡¯t mean an LLM can¡¯t extrapolate/hallucinate/confabulate ? LLMs are ¡®hackable¡¯ computer system and user protections (human-directed filters) can be bypassed via prompt injections both trivial and more complex ? Output very often bland, no match for a true subject matter expert 47
  • 49. ? Questions My responses are limited. You must ask the right questions. 48
  • 50. References Gage, P. (2019) A new algorithm for data compression. The C Users Journal, 12, 23¨C38. Available at: https://typeset.io/papers/a-new-algorithm-for-data-compression-3htk4tchd5. James, G., Witten, D., Hastie, T., et al. (2021) An Introduction to Statistical Learning with Applications in R. Second Edition. Springer. Available at: https://www.statlearning.com/. Kambhampati, S. (2023) Beauty, lies & ChatGPT: Welcome to the post-truth world. The Hill. Available at: https://thehill.com/opinion/technology/3861182-beauty-lies-chatgpt-welcome-to-the-post-truth- world/. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017) Attention is all you need. In: Advances in neural information processing systems (eds I Guyon, UV Luxburg, S Bengio, et al.), 2017. Curran Associates, Inc. Available at: https: //proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Villalobos, P., Ho, A., Sevilla, J., et al. (2024) Will we run out of data? Limits of LLM scaling based on human-generated data. Available at: https://arxiv.org/abs/2211.04325. Wolfram, S. (2023) What Is ChatGPT Doing ¡­ and Why Does It Work? Wolfram Media. Available at: https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/. Zhao, W. X., Zhou, K., Li, J., et al. (2023) A survey of large language models. Available at: https://arxiv.org/abs/2303.18223.