�ݺ�ߣ

NLP in the WILD
-or-
Building a System for
Text Language Identification
Vsevolod Dyomkin
12/2016

A Bit about Me
* Lisp programmer
* 5+ years of NLP work at Grammarly
* Occasional lecturer
https://vseloved.github.io

Langid Problem
* 150+ langs in Wikipedia
* >10 writing systems
(script/alphabet) in active use
* script-lang: 1:1, 1:2, 1:n, n:1 :)
* Latin >50 langs, Cyrillyc >20
* Long texts easy, short hmm– –
* Internet texts (mixed langs)
* Small task => resource-constrained

Twitter Case Study
https://blog.twitter.com/2015/evaluating-language-
identification-performance

Prior Art
* C++: https://github.com/CLD2Owners/cld2
* Python: https://github.com/saffsd/langid.py
* Java:
https://github.com/shuyo/language-detection/
http://blog.mikemccandless.com/2011/10/accuracy
-and-performance-of-googles.html
http://lab.hypotheses.org/1083
http://labs.translated.net/language-identifier/

NLP in the WILD or Building a System for Text Language Identification

YALI WILD
* All of them use weak models
* Wanted to use Wiktionary —
150+ languages, always evolving
* Wanted to do in Lisp

Linguistics
(domain knowledge)
* Polyglots?
* ISO 639
* Internet lang bias
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
* Rule-based ideas:
- 1:1/1:2 scripts
- unique letters
* Per-script/per-lang segmentation
insight
data

Data
* evaluation data:
- smoke test
- in-/out-of-domain data
- precision-/recall-oriented
* training data
- where to get? Wikidata
- how to get? SAX parsing

Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,…)
* ~100 langs

Wikipedia
* >150 langs
* size? Wikipedia abstracts
* automation?
* filtering?

Alternatives
* API
(defun get-examples (word)
(remove-if-not
^(upper-case-p (char % 0))
(mapcar ^(substitute #Space #_ (? % "text"))
(? (yason:parse
(drakma:http-request
(fmt "http://api.wordnik.com/v4/word.json/~A/examples"
(drakma:url-encode word :utf-8))
:additional-headers *wordnik-auth-headers*))
"examples"))))
* Web scraping
(defmethod scrape ((site (eql :linguaholic)) source)
(match-html source
'(>> article
(aside (>> a ($ user))
(>> li (strong "Native Tongue:") ($ lang)))
(div |...| (>> (div :data-role "commentContent")
($ text) (span) |...|))
!!!))))

Research
(quality)
* Simple task => simple models (NB)
* Challenges
- short texts
- mixed langs
- 90% of data - cryptic
ideas
experiments

Naive Bayes
* Features: 3-/4-char ngrams
* Improvement ideas:
- add words (word unigrams)
- factor in word lengths
- use Internet lang bias
Formula:
(argmax (* (? priors lang)
(or (? word-probs word)
(norm (reduce '* ^(? 3g-probs %)
(word-3gs word)))))
langs)
http://www.paulgraham.com/spam.html

Experiments
* Usual ML setup (70:30) doesn't
work here
* “If you torment the data too
much...” (~c) Yaser Abu-Mosafa
* Comparison with existing systems
helps

The Ladder of NLP
Rule-based
Linear ML
Decision Trees & co.
Sequence models
Artificial Neural networks

Better Models
What can be improved?
* Account for word order
* Discriminative models per script
* DeepLearning™ model
Marginal gain is not huge…

Engineer
(efficiency)
* Just a small piece
of the pipeline:
- good-enough speed
- minimize space usage
- minimize external dependencies
* Proper floating-point calculations
* Proper processing of big texts?
* Pre-/post-processing
* Clean API
implementation
optimization

Model Optimization
Initial model size: ~1G
Target: ~10M :)
How to do it?
- Lossy compression: pruning
- Lossless compression:
Huffman coding, efficient DS

API
* Levels of detalization:
- text-langs
- word-langs
- window?
* UI: library, REPL & Web APIs

Recap
* Triple view of any
knowledge-related problem
* Ladder of approaches to solving
NLP problems
* Importance of productive env:
general- & special-purpose
REPL lang API access to data– –
efficient testing–
* Main stages of problem solving:
data experiment→ →
implementation optimization→

�ݺ�ߣ

NLP in the WILD or Building a System for Text Language Identification

More Related Content

NLP in the WILD or Building a System for Text Language Identification