�ݺ�ߣ

An Empirical Study to Use
Large Language Models to Extract
Named Entities from Repetitive
Texts
Angelica Lo Duca
Researcher @ Institute of Informatics and Telematics
National Research Council, Italy

The Problem
Extract Named Entities from a text (e.g., registry)
with repetitive structure.

The Problem
5510 Sunday ﬁrst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
named him David Haim besiman tov (Heb.)
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)

The Problem
5510 Sunday ﬁrst day of Rosh Chodesh
Cheshvan (Heb.) which corresponds to
October 12, 1749. A daughter was born to the
Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev
(Heb.) 5510 which corresponds to December
3, 1749. A son was born to the Lord Samuel
Cardoso and on the day of the mila he
Tuesday night of the 2nd of the month of
Adar Sheni (Heb.) which corresponds to
March 10, 1750. A daughter was born to the
Lord Angiolo Leucci and she was named
Ester bemazal tov (Heb.)
Name Date Sex Father
Ribqa 1749/10/12 F Salamon
Sezzi
David
Haim
1749/12/03 M Samuel
Cardoso
Ester 1750/03/10 F Angiolo
Leucci

Traditional Methods
Rule-based
Rely on predeﬁned rules
and patterns crafted
based on the language’s
linguistic properties.
In a previous paper, we
used this approach for
the same use-case
described in this paper.
Learning-based
Learning-based
approaches utilize
machine learning
algorithms to learn from
annotated datasets.
These approaches can
range from traditional
machine learning
methods to more
advanced deep learning
techniques.
Hybrid
Hybrid approaches
combine the strengths of
both rule-based and
learning-based methods.
They often employ
machine learning models
to capture complex
patterns and use
rule-based systems to
handle well-deﬁned
entities.

In this paper
Use Large Language Models (LLMs)
to extract Named Entities

Repetitive
Text
Record
LLM-based
App
Instructions
Template

Repetitive
Text
Record
LLM-based
App
Instructions
Template
Birth Registry of
Jewish Community
in Pisa.

Birth Registry of Jewish Community in Pisa
● 262 records related to the members of the Pisa Jewish community,
● each record contains a member’s name, date of birth, sex, and father’s
name.
● Range 1749-1809.
Name Date Sex Father
Ribqa 1749/10/12 F Salamon
Sezzi
David
Haim
1749/12/03 M Samuel
Cardoso
Ester 1750/03/10 F Angiolo
Leucci
5510 Sunday ﬁrst day of Rosh Chodesh Cheshvan (Heb.)
which corresponds to October 12, 1749. A daughter was
born to the Lord Salamon Sezzi and was named Ribqa
bemazal tov (Heb.)
Night of Wednesday 23rd of the month Kislev (Heb.) 5510
which corresponds to December 3, 1749. A son was born to
the Lord Samuel Cardoso and on the day of the mila he
Tuesday night of the 2nd of the month of Adar Sheni (Heb.)
which corresponds to March 10, 1750. A daughter was born
to the Lord Angiolo Leucci and she was named Ester
bemazal tov (Heb.)

An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts

Repetitive
Text
Record
LLM-based
App
Instructions
Template
Compare the
performance of
two LLMs:
● GPT 3.5 Turbo
● GPT 4

Repetitive
Text
Record
LLM-based
App
Instructions
Template
Three levels of instruction templates
● simple
● medium
● detailed
based on the level of detail they
describe.

2 models
● GPT 3.5 Turbo
● GPT 4
3 templates
● simple
● medium
● detailed
x = 6 experiments

Simple template
For each line extract:
- child name, father name,
sex, date of birth and format as CSV
Instructions:
- If you find a son, set sex to M
- If you find a daughter, sex is F
- Do not include besiman tov
in the child’s name
Answer by formatting the output in
CSV.
Zero-shot prompting
This consists of giving the
model a task without
providing examples.

Medium template
- extract child name, father name, sex,
date of birth and format as CSV
Instructions:
- Do not include besiman tov
in the child’s name
One-shot prompting
Provide a speciﬁc
example to tell the
model how to respond.
Follow this example:
Input:
<5510 Giorno di domenica primo giorno
di Rosh Chodesh Cheshvan che
corresponde a 12 ottobre 1749.
Naque una figlia al Signore Salamon Sezzi
e si pose nome Ribqa bemazal tov>
Output:
Ribqa,Salamon Sezzi,F,1749-10-17

Detailed template
- extract child name,father name,
sex, date of birth and format as CSV
Instructions:
- Do not include besiman tov in the child’s name
Follow this example:
Input:
<5510 Giorno di domenica primo giorno
di Rosh Chodesh Cheshvan
che corresponde a 12 ottobre 1749.
Naque una figlia al Signore Salamon Sezzi
e si pose nome Ribqa bemazal tov>
Output:
Ribqa,Salamon Sezzi,F,1749-10-17
Answer by formatting the output in CSV.
One-shot prompting
Provide a speciﬁc
example to tell the
model how to respond.
- If the child’s name is not present,
add only a comma
- If the father’s name is not present,
add only a comma
- If the date of birth is not present,
add only a comma

Metrics
Father Ratio: the ratio between the number of entities recognized correctly
as a father and the total number of records
Child Ratio
Date Ratio
Sex Ratio
Total Ratio: the ratio between the sum of the number of entities recognized
correctly as a father, the number of entities recognized correctly as a child,
the number of records where the child’s sex is correctly identiﬁed, the
number of records where the child’s date of birth is correctly identiﬁed, and
the total number of records multiplied by four.

The most difficult to recognize

GPT 4 Medium and Detailed and
GPT 3.5 Detailed are the best models

High performance
High cost
Cost
Total
Ratio
Low performance
Low cost
Low performance
High cost
High performance
Low cost

High performance
High cost
Cost
Total
Ratio
High performance
Low cost
Low performance
Low cost
Low performance
High cost

Lessons from the experiments
● GPT 3.5 Turbo: Ideal for cost-sensitive tasks.
● GPT 4: Best for applications requiring high accuracy.
● Clear instructions optimize results.
● Precision requires clarity.

Conclusions and Future Work
● This paper explored the application of LLMs, speciﬁcally GPT 3.5
Turbo and GPT 4, for extracting named entities from repetitive
texts.
● This paper has demonstrated that all the tested LLMs reach a
total ratio greater than 0.75.
● In all cases, costs should also be considered while choosing the
best model.
● Future work could include comparing them with models released
by different providers, such as Google and Meta.

Thanks for attention!
angelica.loduca@iit.cnr.it
https://www.linkedin.com/in/angelicaloduca/
https://alod83.medium.com/
Questions?

�ݺ�ߣ

An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts

Recommended

More Related Content

Recently uploaded (20)

Featured (20)

An Empirical Study to Use Large Language Models to Extract Named Entities from Repetitive Texts