ݺߣ

ݺߣShare a Scribd company logo
Phylogenetic methods of language diversification Robin J. Ryder CEREMADE – Paris Dauphine and CREST – ENSAE Work done at the Department of Statistics, University of Oxford, under the supervision of Geoff K. Nicholls www.slideshare.net/robinryder
What to expect Past attempts: Swadesh and glottochronology
Background from Evolutionary Biology
Modern methods (a sample) + criticisms
Application to dating of Proto-Indo-European
Before we start... Statistics: additional insight alongside the comparative method
None of these models represent the truth. Nonetheless, they can provide us with information.
Please interrupt me!
What Statistics add Quantitative estimates
Estimation of uncertainty
Model testing
Automatization
Swadesh and glottochronology 200/100 word list
Compares 2 languages (c=fraction of shared cognates)
Assumes r=fraction of shared cognates after 1000 years constant for all languages (86%)
Infers age t of Most Recent Common Ancestor
I you (singular) he we you (plural) they this that here there who what where when how not all many some few other one two three four five big long wide thick heavy small short narrow thin woman man (adult male) man (human being) child wife husband mother father animal fish bird dog louse snake worm tree forest stick fruit seed leaf root bark flower grass rope skin meat blood bone fat (n.) egg horn tail feather hair head ear eye nose mouth tooth tongue fingernail foot leg knee hand wing belly guts neck back breast heart liver drink eat bite suck spit vomit blow breathe laugh see hear know think smell fear sleep live die kill fight hunt hit cut split stab scratch dig swim fly (v.) walk come lie sit stand turn fall give hold squeeze rub wash wipe pull push throw tie sew count say sing play float flow freeze swell sun moon star water rain river lake sea salt stone sand dust earth cloud fog sky wind snow ice smoke fire ashes burn road mountain red green yellow white black night day year warm cold full new old good bad rotten dirty straight round sharp dull smooth wet dry correct near far right left at in with and if because name
Bergsland & Vogt (1962) Found different rates for different pairs of languages: Old Norse and Icelandic, Georgian and Mingrelian, Armenian and Old Armenian
Discredited Glottochronology
Sankoff (1973): sample selection bias, no estimation of uncertainty
Fair criticism
Bad observation protocol from Swadesh
Does not apply so much to modern methods
Genetics 101 Genetic information is stored in DNA
DNA uses 4 letters: A, C, T and G
DNA transmission
DNA transmission
DNA transmission
DNA transmission
Phylogenetics A: TTGCAATCCG B: TAGCAATCCG C: CTGCAATACG D: CTGCAATAGA
Compare different possible trees
Charles Darwin «The formation of different languages and of distinct species, and the proofs that both have developed through a gradual process, are curiously parallel... We find in distinct languages striking homologies due to community of descent, and analogies due to a similar process of formation.»
Similarities between genes and languages As in genetics, a tree model is relevant for certain types of linguistic data. Characteristic Genetics Linguistics Discrete units Genes, nucleotides   Lexical, morpholosyntactic    and/or phonological traits Transmission Transcription   Learning, imitation Horizontal Viruses, hybridization...   Borrowing, creoles... transmission Change Point mutation, indels...   Vowel shift, innovations,    word loss
Indo-European languages
Questions Topology
Internal ages
Age of the root: 6000-6500 BP or 8000-9500 BP?
(BP=Before Present)
Core vocabulary 100 or 200 meanings, present in almost all languages :  bird, hand, to eat, red...
Borrowing is possible (non-tree-like change), but:
“ Easy” to detect
Uncommon
Does not introduce systematic bias
Data coding Old English:  stierfþ Old High German:  stirbit ,  touwit Avestan:  miriiete Old Church Slavonic:  umĭretŭ Latin:  moritur Oscan: ? Cognacy classes: 1.  {stierfþ, stirbit} 2.  {touwit} 3.  {miriiete, umĭretŭ, moritur}
Data coding (2) Specialist linguists make cognacy judgments
Eliminate known borrowing
Only do this for languages which are known to be related
Data Indo-European languages
Core vocabulary (Swadesh 100 or 200)‏
Two data sets
Dyen et al. (1997): 87 languages, mostly modern
Ringe et al. (2002): 24 languages, mostly ancient
Constraints Constraints on parts of the topology
Constraints on some internal ages
We use these constraints to infer rates and other ages
Using models from biology First attempts: Jordan & Gray (2000), Gray & Atkinson (2003)
Biological models make assumptions which do not apply to languages
Gray and Atkinson (2003); tree of 87 Indo-European languages obtained using lexical data and the mrbayes package (Huelsenbeck & Ronquist).
Selection of criticisms Multiple births

More Related Content

Talk at Institut Jean Nicod on 6 October 2010