why lob corpora is important
The Lancaster-Oslo/Bergen (LOB) Corpus is a one-million-word collection of British English texts which was compiled in the 1970s in collaboration between the University of Lancaster, the University of Oslo, and the Norwegian Computing Centre for the Humanities, Bergen, to provide a British counterpart to the Brown Corpus compiled by Henry Kuera and W. Nelson Francis for American English in the 1960s.
Its composition was designed to match the original Brown corpus in terms of its size and genres as closely as possible using documents published in the UK in 1961 by British authors.[1] Both corpora consist of 500 samples each comprising about 2000 words in the following genres:
1 of 15
More Related Content
LOB CORPORA._Important aspects a translator needs to know
3. WHAT IS A CORPUS?
IT IS A COLLECTION OF ELECTRONICALLY STORED
SEMIOTIC DATA THAT HAS BEEN DESIGNED ACCORDING
TO SPECIFIC CORPUS DESIGN CRITERIA TO BE
MAXIMALLY REPRESENTATIVE OF (A PARTICULAR VARIETY
OF) LANGUAGE OR OTHER SEMIOTIC SYSTEMS (BUTLER,
2004).
4. FROM THE DEFINITION
It can be processed by software (electronically stored
data).
Meaning making. It includes gestures as well (semiotic).
The corpus is representative of a language.The
researchers carefully decide what to include and exclude,
and in what proportion (has been designed carefully).
It represents a valid sample of a language variety or any
other semiotic system (representative). Naturally
occurring examples of language (spoken or written).
When we find out about the corpus we can make
conclusions of the language or semiotic system.
5. WHAT IS CORPUS?
It is a principled and large collection (body)
of authentic texts that are stored in a
computer, an analyzed using software
designed for corpus analysis.
Principled data collection is not done
randomly, but following a planned operation.
Authentic means genuine communication
of people (going about their normal
business). (Sinclair, 1996).
6. Computer Readable Semiotic Data (it makes
the analysis easier, faster and more
accurate).
Authentic Material (people have produced
it in particular social occasions, or they have
been considered as what has been deemed
as authentic).
Designed to be representative.
What is a corpus?
7. A comparable corpus is one corpus in a set of two or more monolingual corpora,
typically each in a different language, built according to the same principles.The
content is therefore similar and results can be compared between the corpora
even though they are not translations of each other (and therefore, there are not
aligned).
Comparable corpus
8. NORMALLY SPECIALIZED COLLECTIONS OF SIMILAR
SOURCE TEXTS IN THE TWO LANGUAGES.
IT CAN BE 卒MINED卒 FOR TERMINOLOGY AND OTHER
EQUIVALENCES SUCH CORPORA.
COMPARABLE BILINGUAL CORPUS
9. THE LOB CORPUS EXISTS IN TWO MAIN VERSIONS:
THE ORIGINAL VERSION AND A POS-TAGGED VERSION.
IN THE TAGGED CORPUS EACH WORD IS
ACCOMPANIED BY A WORD-CLASS TAG, ASSIGNED
THROUGH A COMBINATION OF AUTOMATIC TAGGING
PROGRAMS AND MANUAL PRE- AND POST-EDITING.
12. Tagged versions
Each word is accompanied by a word-class tag
There is no syntactic bracketing.
I: a horizontal format, with a running text where each word is immediately
followed by its associated tag;
II: a vertical format, where each word is on a separate line together with its
associated tag, some 'special information' and a reference number.