際際滷

際際滷Share a Scribd company logo
Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.
About Me
¢ Github: @wanasit
$ Text / NLP projects
¢ Manager, Software Engineer @ Indeed
$ Search Quality (Metadata) team
$ Work on NLP problems for Jobs / Resumes
Disclaimer
1. This talk NOT related to any of Indeed¨s technology
2. I¨m not a Japanese (or a native-speaker)
$ But I built a Japanese tokenizer on my free time
Today Topics
¢ NLP and Tokenization (for Japanese)
¢ Lattice-based Tokenizers (MeCab -style tokenizers)
¢ How it works
$ Dictionary
$ Tokenization
NLP and Tokenization
NLP and Tokenization
¢ How does computer represent text?
¢ String (or Char[ ] or Byte[ ] )
* "Abc"
* "Hello World"
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
¢ What¨s the topic?
¢ Who is winning? where?
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
¢ What¨s the topic?
¢ Who is winning? where?
Source: NBC News
NLP and Tokenization
¢ Tokenization / Segmentation
¢ The ?rst step to solve NLP problems is usually
identifying words from the string
$ Input: string, char[ ] (or byte[ ])
$ Output: a list of meaningful words (or tokens)
NLP and Tokenization
"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]
Japanese Tokenization
"バイデン箆がミシガン巒拈、寄yIにむけ^藍返"
Source: TBS News
Japanese Tokenization
"バイデン箆がミシガン巒拈、寄yIにむけ^藍返"
Source: TBS News
Japanese Tokenization
"バイデン箆がミシガン巒拈、寄yIにむけ^藍返"
¢ No punctuations
¢ Q: How do you split this into words?
Source: TBS News
Japanese Tokenization
¢ Use prior Japanese knowledge (Dictionary)
$ が, に, ´, 箆, 巒, ´, バイデン
¢ Consider the context and combination of characters
¢ Consider the likelihood
$ e.g. |奨脅 => [|奨, 脅], or [|, 奨脅]
Lattice-based Tokenizers
Lattice-based Tokenizers
¢ aka. MeCab -based tokenizer (or Viterbi tokenizer)
¢ How:
$ From a Dictionary (required)
$ Build a Lattice (or a graph) from surface dictionary terms
$ Run Viterbi algorithm to ?nd the best connected path
Lattice-Based Tokenizers
¢ Most tokenizers are MeCab (C/C++)¨s re-implementation on
different platforms:
$ Kuromoji, Sudachi (Java), Kotori (Kotlin)
$ Janome, SudachiPy (Python)
$ Kagome (Go)
$ ...
Non- Lattice-Based Tokenizers
¢ Is Lattice-based the only approach?
¢ Mostly yes, but there are also:
$ Juman++, Nagisa (RNN)
$ SentencePiece (Unsupervised, used in BERT)
¢ Out-of-scope of this presentation
How it works
> Dictionary
Dictionary
¢ Lattice-based tokenizers need dictionary
$ To recognize prede?ned terms and grammar
¢ Dictionaries are often can be downloaded as Plugins e.g.
$ $ brew install mecab
$ $ brew install mecab-ipadic
Dictionary
¢ Recommended beginner dictionary is MeCab¨s IPADIC
¢ Available from this website
Dictionary - Term Table / Lexicon / CSV ?les
Surface Form
Context ID
(left)
Context ID
(right)
Cost Type Form Spelling ...
|奨 1293 1293 3003 兆~ (place) - トウキョウ ...
奨脅 1293 1293 2135 兆~ (place) - キョウト ...
|奨V 1293 1293 8676 兆~ (place) - ヒガシキョウ
ヅカ
...
佩く 992 992 8852 嘖~ (v) 児云侘 イク ...
佩か 1002 1002 7754 嘖~ (v) 隆隼侘 イカ ...
いく 992 992 9672 嘖~ (v) 児云侘 イク ...
Dictionary - Term Table
¢ Surface Form: How the term should appear in the string
¢ Context ID (left/right): ID used for connecting terms
together (see. later)
¢ Cost: How commonly used the term
$ The more the cost, the less common or less likely
Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
¢ Connection cost between
type of terms.
¢ The lower, the more likely
¢ e.g.
¢ 992 (v-ru) then 992 (v-ru)
$ Cost = 3000 (unlikely)
¢ 992 (v-ru) then 1294 (noun)
$ Cost = -1000 (likely)
Dictionary - Term Table
Term table size:
¢ Kotori (default) ~380,000 terms (3.7 MB)
¢ MeCab-IPADict ~400,000 terms (12.2 MB)
¢ Sudachi - Small ~750,000 terms (39.8 MB)
¢ Sudachi - Full ~2,800,000 terms (121 MB)
Dictionary - Term Table
Term table size:
¢ Kotori (default) ~380,000 terms (3.7 MB)
¢ MeCab-IPADict ~400,000 terms (12.2 MB)
¢ Sudachi - Small ~750,000 terms (39.8 MB)
¢ Sudachi - Full ~2,800,000 terms (121 MB)
$ Include term like: "c(```)ノ"
Dictionary - Term Table
¢ What about words not in the table?
$ e.g. "ワナシット タナキットルンアン"
$ ^Unknown-Term Extraction ̄ Problem
$ Typically, some heuristic rules
* e.g. if there are consecutive katana, it¨s a Noun.
¢ Out-of-scope of this presentation
How it works
> Tokenization
Lattice-Based Tokenization
Given:
¢ The Dictionary
¢ Input:"|奨脅に廖む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice
Step 1: Finding all terms
Step 1: Finding all terms
¢ For each index i-th
$ ?nd all terms in dictionary starting at i-th location
¢ String / Pattern Matching problem
$ Require e?cient lookup data structure for the dictionary
$ e.g. Trie, Finite-State-Transidual
Step 2: Finding minimum cost
¢ Viterbi Algorithm (Dynamic Programing)
¢ For each node from the left to right
$ Find the minimum cost path leading to that node
$ Reuse the selected path when consider the following
nodes
Summary
Introduction to Japanese Tokenizers
¢ Introduction to NLP and Tokenization
¢ Lattice-based tokenizers (MeCab and others)
$ Dictionary
* Term table, Connection Cost, ...
$ Tokenization Algorithms
* Pattern Matching, Viterbi Algorithm, ...
Learn more:
¢ Kotori (on Github), A Japanese tokenizer written in Kotlin
$ Small and performant (fastest among JVM-based)
$ Support multiple dictionary formats
¢ Article: How Japanese Tokenizers Work (by Wanasit)
¢ Article: 晩云Z侘B殆盾裂のY箸鰔く (by Cookpad Developer)
¢ Book: 徭隼冱ZI尖の児A (by Manabu Okumura)

More Related Content

What's hot (20)

プログラミングコンテストでの岱kアルゴリズム
プログラミングコンテストでの岱kアルゴリズムプログラミングコンテストでの岱kアルゴリズム
プログラミングコンテストでの岱kアルゴリズム
Takuya Akiba
?
寄号庁グラフ盾裂のための岱kスケッチ室隈
寄号庁グラフ盾裂のための岱kスケッチ室隈寄号庁グラフ盾裂のための岱kスケッチ室隈
寄号庁グラフ盾裂のための岱kスケッチ室隈
Takuya Akiba
?
AtCoder Beginner Contest 026 盾h
AtCoder Beginner Contest 026 盾hAtCoder Beginner Contest 026 盾h
AtCoder Beginner Contest 026 盾h
AtCoder Inc.
?
プログラミングコンテストでの強議柴鮫隈
プログラミングコンテストでの強議柴鮫隈プログラミングコンテストでの強議柴鮫隈
プログラミングコンテストでの強議柴鮫隈
Takuya Akiba
?
Amortize analysis of Deque with 2 Stack
Amortize analysis of Deque with 2 StackAmortize analysis of Deque with 2 Stack
Amortize analysis of Deque with 2 Stack
Ken Ogura
?
鳩楕議麼撹蛍蛍裂
鳩楕議麼撹蛍蛍裂鳩楕議麼撹蛍蛍裂
鳩楕議麼撹蛍蛍裂
Mika Yoshimura
?
猟忖双アルゴリズム
猟忖双アルゴリズム猟忖双アルゴリズム
猟忖双アルゴリズム
HCPC: 臼今祇寄僥室プログラミングサ`クル
?
Graphic Notes on Linear Algebra and Data Science
Graphic Notes on Linear Algebra and Data ScienceGraphic Notes on Linear Algebra and Data Science
Graphic Notes on Linear Algebra and Data Science
Kenji Hiranabe
?
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
Masatoshi Abe
?
2何グラフの恷弌泣瓜顕の箔め圭
2何グラフの恷弌泣瓜顕の箔め圭2何グラフの恷弌泣瓜顕の箔め圭
2何グラフの恷弌泣瓜顕の箔め圭
Kensuke Otsuki
?
喪モジュラ恷癖晒と字亠僥楼1嫗
喪モジュラ恷癖晒と字亠僥楼1嫗喪モジュラ恷癖晒と字亠僥楼1嫗
喪モジュラ恷癖晒と字亠僥楼1嫗
Hakky St
?
鴛酷皆を聞ったフラクタルの宙鮫
鴛酷皆を聞ったフラクタルの宙鮫鴛酷皆を聞ったフラクタルの宙鮫
鴛酷皆を聞ったフラクタルの宙鮫
Yu(u)ki IWABUCHI
?
強議柴鮫隈を自める
強議柴鮫隈を自める強議柴鮫隈を自める
強議柴鮫隈を自める
HCPC: 臼今祇寄僥室プログラミングサ`クル
?
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
圍 安弥
?
黍議伏撹ネットワ`ク┳甸 ̄沓
黍議伏撹ネットワ`ク┳甸 ̄沓黍議伏撹ネットワ`ク┳甸 ̄沓
黍議伏撹ネットワ`ク┳甸 ̄沓
cvpaper. challenge
?
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
Shuyo Nakatani
?
ACPC 2018 Day3 G: 指猟何蛍双 (Palindromic Subsequences)
ACPC 2018 Day3 G: 指猟何蛍双 (Palindromic Subsequences)ACPC 2018 Day3 G: 指猟何蛍双 (Palindromic Subsequences)
ACPC 2018 Day3 G: 指猟何蛍双 (Palindromic Subsequences)
HCPC: 臼今祇寄僥室プログラミングサ`クル
?
x科弼
x科弼x科弼
x科弼
Ken Ogura
?
プログラミングコンテストでの岱kアルゴリズム
プログラミングコンテストでの岱kアルゴリズムプログラミングコンテストでの岱kアルゴリズム
プログラミングコンテストでの岱kアルゴリズム
Takuya Akiba
?
寄号庁グラフ盾裂のための岱kスケッチ室隈
寄号庁グラフ盾裂のための岱kスケッチ室隈寄号庁グラフ盾裂のための岱kスケッチ室隈
寄号庁グラフ盾裂のための岱kスケッチ室隈
Takuya Akiba
?
AtCoder Beginner Contest 026 盾h
AtCoder Beginner Contest 026 盾hAtCoder Beginner Contest 026 盾h
AtCoder Beginner Contest 026 盾h
AtCoder Inc.
?
プログラミングコンテストでの強議柴鮫隈
プログラミングコンテストでの強議柴鮫隈プログラミングコンテストでの強議柴鮫隈
プログラミングコンテストでの強議柴鮫隈
Takuya Akiba
?
Amortize analysis of Deque with 2 Stack
Amortize analysis of Deque with 2 StackAmortize analysis of Deque with 2 Stack
Amortize analysis of Deque with 2 Stack
Ken Ogura
?
Graphic Notes on Linear Algebra and Data Science
Graphic Notes on Linear Algebra and Data ScienceGraphic Notes on Linear Algebra and Data Science
Graphic Notes on Linear Algebra and Data Science
Kenji Hiranabe
?
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
C亠僥の容協娼業鯢呂里燭瓩篳恬撹圭隈 ゛AbemaTVのユ`ザ奉來容協゛
Masatoshi Abe
?
2何グラフの恷弌泣瓜顕の箔め圭
2何グラフの恷弌泣瓜顕の箔め圭2何グラフの恷弌泣瓜顕の箔め圭
2何グラフの恷弌泣瓜顕の箔め圭
Kensuke Otsuki
?
喪モジュラ恷癖晒と字亠僥楼1嫗
喪モジュラ恷癖晒と字亠僥楼1嫗喪モジュラ恷癖晒と字亠僥楼1嫗
喪モジュラ恷癖晒と字亠僥楼1嫗
Hakky St
?
鴛酷皆を聞ったフラクタルの宙鮫
鴛酷皆を聞ったフラクタルの宙鮫鴛酷皆を聞ったフラクタルの宙鮫
鴛酷皆を聞ったフラクタルの宙鮫
Yu(u)ki IWABUCHI
?
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
方塀を聞わずイメ`ジで尖盾する掘珂アルゴリズム
圍 安弥
?
黍議伏撹ネットワ`ク┳甸 ̄沓
黍議伏撹ネットワ`ク┳甸 ̄沓黍議伏撹ネットワ`ク┳甸 ̄沓
黍議伏撹ネットワ`ク┳甸 ̄沓
cvpaper. challenge
?
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
佛勸仝距睦Q賀デ`タの由柴親僥々及12嫗
Shuyo Nakatani
?

Similar to Introduction to japanese tokenizer (8)

Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak Peek
BADR
?
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
shakimov
?
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
Will Stanton
?
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
?
bigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdfbigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdf
sh5701
?
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
shilpashukla01
?
NLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farNLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so far
Ilias Chalkidis
?
Data Structures Algorithm and Career Guidance
Data Structures Algorithm and Career GuidanceData Structures Algorithm and Career Guidance
Data Structures Algorithm and Career Guidance
SwapnilNarayan
?
Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak Peek
BADR
?
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
shakimov
?
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
Will Stanton
?
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
Yuki Saito
?
bigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdfbigD2_relatiobigD3_mapReducenalDatabases.pdf
bigD2_relatiobigD3_mapReducenalDatabases.pdf
sh5701
?
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
shilpashukla01
?
NLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farNLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so far
Ilias Chalkidis
?
Data Structures Algorithm and Career Guidance
Data Structures Algorithm and Career GuidanceData Structures Algorithm and Career Guidance
Data Structures Algorithm and Career Guidance
SwapnilNarayan
?

More from Fangda Wang (11)

[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?
Fangda Wang
?
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeed
Fangda Wang
?
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questions
Fangda Wang
?
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the world
Fangda Wang
?
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
Fangda Wang
?
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to Scala
Fangda Wang
?
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pair
Fangda Wang
?
Balanced Team
Balanced TeamBalanced Team
Balanced Team
Fangda Wang
?
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and Elm
Fangda Wang
?
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)
Fangda Wang
?
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
Fangda Wang
?
[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?
Fangda Wang
?
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeed
Fangda Wang
?
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questions
Fangda Wang
?
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the world
Fangda Wang
?
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
Fangda Wang
?
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to Scala
Fangda Wang
?
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pair
Fangda Wang
?
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and Elm
Fangda Wang
?
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)
Fangda Wang
?
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
Fangda Wang
?

Recently uploaded (20)

AI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge GraphsAI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge Graphs
Max De Marzi
?
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
J. Agricultural Machinery
?
Env and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdfEnv and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdf
MahmudHasan747870
?
Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...
dhanashree78
?
CS3451-OPERATING-SYSTEM NOTES ALL123.pdf
CS3451-OPERATING-SYSTEM NOTES ALL123.pdfCS3451-OPERATING-SYSTEM NOTES ALL123.pdf
CS3451-OPERATING-SYSTEM NOTES ALL123.pdf
PonniS7
?
google_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPTgoogle_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPT
JayeshShete1
?
How Engineering Model Making Brings Designs to Life.pdf
How Engineering Model Making Brings Designs to Life.pdfHow Engineering Model Making Brings Designs to Life.pdf
How Engineering Model Making Brings Designs to Life.pdf
Maadhu Creatives-Model Making Company
?
decarbonization steel industry rev1.pptx
decarbonization steel industry rev1.pptxdecarbonization steel industry rev1.pptx
decarbonization steel industry rev1.pptx
gonzalezolabarriaped
?
Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
?
Sachpazis: Foundation Analysis and Design: Single Piles
Sachpazis: Foundation Analysis and Design: Single PilesSachpazis: Foundation Analysis and Design: Single Piles
Sachpazis: Foundation Analysis and Design: Single Piles
Dr.Costas Sachpazis
?
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
US Patented ReGenX Generator, ReGen-X Quatum Motor EV Regenerative Accelerati...
Thane Heins NOBEL PRIZE WINNING ENERGY RESEARCHER
?
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
?
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
?
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
?
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
?
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
ASHISHDESAI85
?
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
?
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
?
Multi objective genetic approach with Ranking
Multi objective genetic approach with RankingMulti objective genetic approach with Ranking
Multi objective genetic approach with Ranking
namisha18
?
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
?
AI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge GraphsAI, Tariffs and Supply Chains in Knowledge Graphs
AI, Tariffs and Supply Chains in Knowledge Graphs
Max De Marzi
?
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
Optimization of Cumulative Energy, Exergy Consumption and Environmental Life ...
J. Agricultural Machinery
?
Env and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdfEnv and Water Supply Engg._Dr. Hasan.pdf
Env and Water Supply Engg._Dr. Hasan.pdf
MahmudHasan747870
?
Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...Air pollution is contamination of the indoor or outdoor environment by any ch...
Air pollution is contamination of the indoor or outdoor environment by any ch...
dhanashree78
?
CS3451-OPERATING-SYSTEM NOTES ALL123.pdf
CS3451-OPERATING-SYSTEM NOTES ALL123.pdfCS3451-OPERATING-SYSTEM NOTES ALL123.pdf
CS3451-OPERATING-SYSTEM NOTES ALL123.pdf
PonniS7
?
google_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPTgoogle_developer_group_ramdeobaba_university_EXPLORE_PPT
google_developer_group_ramdeobaba_university_EXPLORE_PPT
JayeshShete1
?
decarbonization steel industry rev1.pptx
decarbonization steel industry rev1.pptxdecarbonization steel industry rev1.pptx
decarbonization steel industry rev1.pptx
gonzalezolabarriaped
?
Mathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptxMathematics_behind_machine_learning_INT255.pptx
Mathematics_behind_machine_learning_INT255.pptx
ppkmurthy2006
?
Sachpazis: Foundation Analysis and Design: Single Piles
Sachpazis: Foundation Analysis and Design: Single PilesSachpazis: Foundation Analysis and Design: Single Piles
Sachpazis: Foundation Analysis and Design: Single Piles
Dr.Costas Sachpazis
?
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptxMathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
Mathematics behind machine learning INT255 INT255__Unit 3__PPT-1.pptx
ppkmurthy2006
?
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
Structural QA/QC Inspection in KRP 401600 | Copper Processing Plant-3 (MOF-3)...
slayshadow705
?
Cloud Computing concepts and technologies
Cloud Computing concepts and technologiesCloud Computing concepts and technologies
Cloud Computing concepts and technologies
ssuser4c9444
?
How to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using ArduinoHow to Make an RFID Door Lock System using Arduino
How to Make an RFID Door Lock System using Arduino
CircuitDigest
?
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
Integration of Additive Manufacturing (AM) with IoT : A Smart Manufacturing A...
ASHISHDESAI85
?
Piping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdfPiping-and-pipeline-calculations-manual.pdf
Piping-and-pipeline-calculations-manual.pdf
OMI0721
?
TM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdfTM-ASP-101-RF_Air Press manual crimping machine.pdf
TM-ASP-101-RF_Air Press manual crimping machine.pdf
ChungLe60
?
Multi objective genetic approach with Ranking
Multi objective genetic approach with RankingMulti objective genetic approach with Ranking
Multi objective genetic approach with Ranking
namisha18
?
Wireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdfWireless-Charger presentation for seminar .pdf
Wireless-Charger presentation for seminar .pdf
AbhinandanMishra30
?

Introduction to japanese tokenizer

  • 2. About Me ¢ Github: @wanasit $ Text / NLP projects ¢ Manager, Software Engineer @ Indeed $ Search Quality (Metadata) team $ Work on NLP problems for Jobs / Resumes
  • 3. Disclaimer 1. This talk NOT related to any of Indeed¨s technology 2. I¨m not a Japanese (or a native-speaker) $ But I built a Japanese tokenizer on my free time
  • 4. Today Topics ¢ NLP and Tokenization (for Japanese) ¢ Lattice-based Tokenizers (MeCab -style tokenizers) ¢ How it works $ Dictionary $ Tokenization
  • 6. NLP and Tokenization ¢ How does computer represent text? ¢ String (or Char[ ] or Byte[ ] ) * "Abc" * "Hello World"
  • 7. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" Source: NBC News
  • 8. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ¢ What¨s the topic? ¢ Who is winning? where? Source: NBC News
  • 9. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ¢ What¨s the topic? ¢ Who is winning? where? Source: NBC News
  • 10. NLP and Tokenization ¢ Tokenization / Segmentation ¢ The ?rst step to solve NLP problems is usually identifying words from the string $ Input: string, char[ ] (or byte[ ]) $ Output: a list of meaningful words (or tokens)
  • 11. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally".split(/W+/) > ["Biden", "is", "projected", "winner", "in", ...]
  • 14. Japanese Tokenization "バイデン箆がミシガン巒拈、寄yIにむけ^藍返" ¢ No punctuations ¢ Q: How do you split this into words? Source: TBS News
  • 15. Japanese Tokenization ¢ Use prior Japanese knowledge (Dictionary) $ が, に, ´, 箆, 巒, ´, バイデン ¢ Consider the context and combination of characters ¢ Consider the likelihood $ e.g. |奨脅 => [|奨, 脅], or [|, 奨脅]
  • 17. Lattice-based Tokenizers ¢ aka. MeCab -based tokenizer (or Viterbi tokenizer) ¢ How: $ From a Dictionary (required) $ Build a Lattice (or a graph) from surface dictionary terms $ Run Viterbi algorithm to ?nd the best connected path
  • 18. Lattice-Based Tokenizers ¢ Most tokenizers are MeCab (C/C++)¨s re-implementation on different platforms: $ Kuromoji, Sudachi (Java), Kotori (Kotlin) $ Janome, SudachiPy (Python) $ Kagome (Go) $ ...
  • 19. Non- Lattice-Based Tokenizers ¢ Is Lattice-based the only approach? ¢ Mostly yes, but there are also: $ Juman++, Nagisa (RNN) $ SentencePiece (Unsupervised, used in BERT) ¢ Out-of-scope of this presentation
  • 20. How it works > Dictionary
  • 21. Dictionary ¢ Lattice-based tokenizers need dictionary $ To recognize prede?ned terms and grammar ¢ Dictionaries are often can be downloaded as Plugins e.g. $ $ brew install mecab $ $ brew install mecab-ipadic
  • 22. Dictionary ¢ Recommended beginner dictionary is MeCab¨s IPADIC ¢ Available from this website
  • 23. Dictionary - Term Table / Lexicon / CSV ?les Surface Form Context ID (left) Context ID (right) Cost Type Form Spelling ... |奨 1293 1293 3003 兆~ (place) - トウキョウ ... 奨脅 1293 1293 2135 兆~ (place) - キョウト ... |奨V 1293 1293 8676 兆~ (place) - ヒガシキョウ ヅカ ... 佩く 992 992 8852 嘖~ (v) 児云侘 イク ... 佩か 1002 1002 7754 嘖~ (v) 隆隼侘 イカ ... いく 992 992 9672 嘖~ (v) 児云侘 イク ...
  • 24. Dictionary - Term Table ¢ Surface Form: How the term should appear in the string ¢ Context ID (left/right): ID used for connecting terms together (see. later) ¢ Cost: How commonly used the term $ The more the cost, the less common or less likely
  • 25. Dictionary - Connection Table / Connection Cost Context ID (from) Context ID (to) Cost ... ... 992 992 3003 992 993 2135 ... ... 992 1293 -1000 992 1294 -1000 ... ... ¢ Connection cost between type of terms. ¢ The lower, the more likely ¢ e.g. ¢ 992 (v-ru) then 992 (v-ru) $ Cost = 3000 (unlikely) ¢ 992 (v-ru) then 1294 (noun) $ Cost = -1000 (likely)
  • 26. Dictionary - Term Table Term table size: ¢ Kotori (default) ~380,000 terms (3.7 MB) ¢ MeCab-IPADict ~400,000 terms (12.2 MB) ¢ Sudachi - Small ~750,000 terms (39.8 MB) ¢ Sudachi - Full ~2,800,000 terms (121 MB)
  • 27. Dictionary - Term Table Term table size: ¢ Kotori (default) ~380,000 terms (3.7 MB) ¢ MeCab-IPADict ~400,000 terms (12.2 MB) ¢ Sudachi - Small ~750,000 terms (39.8 MB) ¢ Sudachi - Full ~2,800,000 terms (121 MB) $ Include term like: "c(```)ノ"
  • 28. Dictionary - Term Table ¢ What about words not in the table? $ e.g. "ワナシット タナキットルンアン" $ ^Unknown-Term Extraction ̄ Problem $ Typically, some heuristic rules * e.g. if there are consecutive katana, it¨s a Noun. ¢ Out-of-scope of this presentation
  • 29. How it works > Tokenization
  • 30. Lattice-Based Tokenization Given: ¢ The Dictionary ¢ Input:"|奨脅に廖む" Tokenizer: 1. Find all terms in the input and build a lattice 2. Find the minimum cost path through the lattice
  • 31. Step 1: Finding all terms
  • 32. Step 1: Finding all terms ¢ For each index i-th $ ?nd all terms in dictionary starting at i-th location ¢ String / Pattern Matching problem $ Require e?cient lookup data structure for the dictionary $ e.g. Trie, Finite-State-Transidual
  • 33. Step 2: Finding minimum cost ¢ Viterbi Algorithm (Dynamic Programing) ¢ For each node from the left to right $ Find the minimum cost path leading to that node $ Reuse the selected path when consider the following nodes
  • 35. Introduction to Japanese Tokenizers ¢ Introduction to NLP and Tokenization ¢ Lattice-based tokenizers (MeCab and others) $ Dictionary * Term table, Connection Cost, ... $ Tokenization Algorithms * Pattern Matching, Viterbi Algorithm, ...
  • 36. Learn more: ¢ Kotori (on Github), A Japanese tokenizer written in Kotlin $ Small and performant (fastest among JVM-based) $ Support multiple dictionary formats ¢ Article: How Japanese Tokenizers Work (by Wanasit) ¢ Article: 晩云Z侘B殆盾裂のY箸鰔く (by Cookpad Developer) ¢ Book: 徭隼冱ZI尖の児A (by Manabu Okumura)