4. Langid Problem
* 150+ langs in Wikipedia
* >10 writing systems
(script/alphabet) in active use
* script-lang: 1:1, 1:2, 1:n, n:1 :)
* Latin >50 langs, Cyrillyc >20
* Long texts easy, short hmm
* Internet texts (mixed langs)
* Small task => resource-constrained
8. YALI WILD
* All of them use weak models
* Wanted to use Wiktionary
150+ languages, always evolving
* Wanted to do in Lisp
9. Linguistics
(domain knowledge)
* Polyglots?
* ISO 639
* Internet lang bias
https://en.wikipedia.org/wiki/Languages_used_on_the_Internet
* Rule-based ideas:
- 1:1/1:2 scripts
- unique letters
* Per-script/per-lang segmentation
insight
data
10. Data
* evaluation data:
- smoke test
- in-/out-of-domain data
- precision-/recall-oriented
* training data
- where to get? Wikidata
- how to get? SAX parsing
11. Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,)
* ~100 langs
12. Wiktionary
* good source for various
dictionaries and word lists (word
forms, definitions, synonyms,)
* ~100 langs
15. Research
(quality)
* Simple task => simple models (NB)
* Challenges
- short texts
- mixed langs
- 90% of data - cryptic
ideas
experiments
16. Naive Bayes
* Features: 3-/4-char ngrams
* Improvement ideas:
- add words (word unigrams)
- factor in word lengths
- use Internet lang bias
Formula:
(argmax (* (? priors lang)
(or (? word-probs word)
(norm (reduce '* ^(? 3g-probs %)
(word-3gs word)))))
langs)
http://www.paulgraham.com/spam.html
18. Experiments
* Usual ML setup (70:30) doesn't
work here
* If you torment the data too
much... (~c) Yaser Abu-Mosafa
* Comparison with existing systems
helps
20. The Ladder of NLP
Rule-based
Linear ML
Decision Trees & co.
Sequence models
Artificial Neural networks
21. Better Models
What can be improved?
* Account for word order
* Discriminative models per script
* DeepLearning model
Marginal gain is not huge
22. Engineer
(efficiency)
* Just a small piece
of the pipeline:
- good-enough speed
- minimize space usage
- minimize external dependencies
* Proper floating-point calculations
* Proper processing of big texts?
* Pre-/post-processing
* Clean API
implementation
optimization
23. Model Optimization
Initial model size: ~1G
Target: ~10M :)
How to do it?
- Lossy compression: pruning
- Lossless compression:
Huffman coding, efficient DS
24. API
* Levels of detalization:
- text-langs
- word-langs
- window?
* UI: library, REPL & Web APIs
25. Recap
* Triple view of any
knowledge-related problem
* Ladder of approaches to solving
NLP problems
* Importance of productive env:
general- & special-purpose
REPL lang API access to data
efficient testing
* Main stages of problem solving:
data experiment
implementation optimization