際際滷

際際滷Share a Scribd company logo
Institut f端r Anthropomatik1 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Grammatical Agreement in SMT
Seminar Sprach-zu-Sprach-bersetzung
SS 2013
Institut f端r Anthropomatik2 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Inflection
 Modification of a word
 signals grammatical variants (tense, gender, case, )
 e.g. walk vs. Walked
Agreement
 Inflection for related words in a sentence has to agree
 e.g. das Haus vs. die Haus
Some languages are weakly inflected (e.g. English)
Some are highly inflected (e.g. German, Arabic, )
Inflection and Agreement
Institut f端r Anthropomatik3 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Local Agreement Errors
Ref:
the-carF
goF
with-speed
Hypo:
the-carF
goM
with-speed
Long-distance Agreement Errors
Ref: celle qui parle , cest ma femme
oneF
who speak , is my wifeF
Hypo: celui qui parle est ma femme
oneM
who speak is my spouseF
Agreement Errors
Institut f端r Anthropomatik4 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Approaches for SMT
Morphological Generation
 Create raw stems and modify with predicted inflection
Agreement Constraints
 Use SCFG of target and add constraints to it
Class-based Agreement Model
 Use morphological word classes Noun+Def+Sg+Fem
Institut f端r Anthropomatik5 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Morphological Generation: Idea
Generating Complex Morphology for Machine Translation (Minkov
and Toutanova, 2007)
Convert MT output to stem sequence
Predict an inflection for every stem
Reflect meaning and comply with agreement rules
Institut f端r Anthropomatik6 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Morphological Generation: Lexicons
Morphology analysis and generation
Operations:
 Stemming
 Inflection
 Morphological analysis
Create manually
Create automatically from data
Here: assumed as given
Institut f端r Anthropomatik7 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Morphological Generation: Inflection Prediction
Maximum Entropy Markov model (2nd
order)
Features:
 Monolingual
 Bilingual
 Lexical
 Morphological
 Syntactic
p(yLx)=t=1
n
p(yt yt1 , yt2 , xt ) , yt It
Institut f端r Anthropomatik8 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Morphological Generation: Evaluation
English-Russian and English-Arabic
Technical (software manual) domain
Input: Aligned sentence pairs of reference translations (no output of MT
System)  reduce noise
Accuracy (%) results
Institut f端r Anthropomatik9 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Morphological Generation: Conclusion
Needed resources:
 Large corpus of aligned sentence pairs
 Lexicons (source and target) with the three operations
+ Better accuracy than simple LM (even with small training data)
+ Easy to add to existing MT system
- Expensive creation of lexicons
Institut f端r Anthropomatik10 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Idea
Agreement Constraints for Statistical Machine Translation into
German (Williams and Koehn, 2011)
String-to-tree model
Synchronous grammar for target language
Adding learned constraints and probabilities
Evaluation of constraints during decoding
Institut f端r Anthropomatik11 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Feature Structure
Feature structure
Unification
Institut f端r Anthropomatik12 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Grammar
Synchronous grammar learned from parallel corpus
Extended by constraints at target-side
Sample rule/constraint:
NP-SB  the X1
cat | die AP1
Katze
Institut f端r Anthropomatik13 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Training
Propagation rules to
capture NP/PP agreements:
Applied bottom-up
Institut f端r Anthropomatik14 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Decoding
Model:
Every element of rule/constraint has a feature structure
Constraint evaluation: Each hypothesis stores set of feature structures
corresponding to its root rule element
Recombination of hypotheses is possible
t=arg max
t
p(ts)
p(ts)=
1
Z

i=1
n
了i hi (s ,t)
Institut f端r Anthropomatik15 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Evaluation
English-German
Europarl and News Commentary
Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit
Treebank for target
Grammar: ~140 m rules
BLEU scores and p-values for three test sets
Institut f端r Anthropomatik16 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Constraints: Conclusion
Needed resources:
 Parallel corpus
 Heuristics for constraint extraction
+ Improvement in translation accuracy
- Improvement is quite small
Institut f端r Anthropomatik17 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Idea
1. Segmentation
2. Tagging
3. Scoring
A Class-Based Agreement Model for Generating Accurately Inflected
Translations (Green and DeNero, 2012)
During Decoding
Target-Side
Three Steps:
Institut f端r Anthropomatik18 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Segmentation
Train conditional random field
Features:
Centered 5-character window
During decoding
Not as preprocessing step
Labels:
I: Continuation (Inside)
O: Outside (whitespace)
B: Beginning
F: Non-native chars
Institut f端r Anthropomatik19 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Tagging
Train CRF on full sentences with gold classes
Features:
 Current and previous words, affixes, etc.
Labels:
 Morphological classes
 Gender, number, person, definiteness
 e.g. 89 classes for Arabic
Example:
'the car'
Tagged: Noun+Def+Sg+Fem
Institut f端r Anthropomatik20 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Scoring
Scoring of word sequences not comparable across hypotheses
 Scoring class sequences with generative model
Simple bigram LM over gold class sequences (add-1 smoothed)
' =arg max

p(Ls)
q(e)= p(')=i=1
I
p('iO'i1)
Institut f端r Anthropomatik21 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Evaluation
English-Arabic
Training data: variety of sources (e.g. web)
Development and Test: NIST sets (Newswire and mixed genre
[broadcast news, newsgroups, weblog])
Phrase-based decoder
BLEU score for newswire sets
BLEU score for mixed genre sets
Institut f端r Anthropomatik22 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Class-Based: Conclusion
Needed resources:
 Treebank for target (existing for many languages)
 Large target corpus
+ Improves translation quality
+ Easy to integrate in existing MT system
- Increases decoding time
- Not very good for mixed genres
Institut f端r Anthropomatik23 24.06.13 Simon Hummel  Lehrstuhl Prof. Waibel
Green, S. and DeNero, J. (2012). A Class-Based Agreement Model for
Generating Accurately Inflected Translations. In: ACL.
Williams, P. and Koehn, P. (2011). Agreement Constraints for Statistical
Machine Translation into German. In: Sixth Workshop on Statistical
Machine Translation
Minkov, E. and Toutanova, K. (2007) Generating Complex Morphology
for Machine Translation. In: ACL.
References

More Related Content

Grammatical Agreement in SMT

  • 1. Institut f端r Anthropomatik1 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Grammatical Agreement in SMT Seminar Sprach-zu-Sprach-bersetzung SS 2013
  • 2. Institut f端r Anthropomatik2 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Inflection Modification of a word signals grammatical variants (tense, gender, case, ) e.g. walk vs. Walked Agreement Inflection for related words in a sentence has to agree e.g. das Haus vs. die Haus Some languages are weakly inflected (e.g. English) Some are highly inflected (e.g. German, Arabic, ) Inflection and Agreement
  • 3. Institut f端r Anthropomatik3 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Local Agreement Errors Ref: the-carF goF with-speed Hypo: the-carF goM with-speed Long-distance Agreement Errors Ref: celle qui parle , cest ma femme oneF who speak , is my wifeF Hypo: celui qui parle est ma femme oneM who speak is my spouseF Agreement Errors
  • 4. Institut f端r Anthropomatik4 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Approaches for SMT Morphological Generation Create raw stems and modify with predicted inflection Agreement Constraints Use SCFG of target and add constraints to it Class-based Agreement Model Use morphological word classes Noun+Def+Sg+Fem
  • 5. Institut f端r Anthropomatik5 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Morphological Generation: Idea Generating Complex Morphology for Machine Translation (Minkov and Toutanova, 2007) Convert MT output to stem sequence Predict an inflection for every stem Reflect meaning and comply with agreement rules
  • 6. Institut f端r Anthropomatik6 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Morphological Generation: Lexicons Morphology analysis and generation Operations: Stemming Inflection Morphological analysis Create manually Create automatically from data Here: assumed as given
  • 7. Institut f端r Anthropomatik7 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Morphological Generation: Inflection Prediction Maximum Entropy Markov model (2nd order) Features: Monolingual Bilingual Lexical Morphological Syntactic p(yLx)=t=1 n p(yt yt1 , yt2 , xt ) , yt It
  • 8. Institut f端r Anthropomatik8 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Morphological Generation: Evaluation English-Russian and English-Arabic Technical (software manual) domain Input: Aligned sentence pairs of reference translations (no output of MT System) reduce noise Accuracy (%) results
  • 9. Institut f端r Anthropomatik9 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Morphological Generation: Conclusion Needed resources: Large corpus of aligned sentence pairs Lexicons (source and target) with the three operations + Better accuracy than simple LM (even with small training data) + Easy to add to existing MT system - Expensive creation of lexicons
  • 10. Institut f端r Anthropomatik10 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Idea Agreement Constraints for Statistical Machine Translation into German (Williams and Koehn, 2011) String-to-tree model Synchronous grammar for target language Adding learned constraints and probabilities Evaluation of constraints during decoding
  • 11. Institut f端r Anthropomatik11 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Feature Structure Feature structure Unification
  • 12. Institut f端r Anthropomatik12 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Grammar Synchronous grammar learned from parallel corpus Extended by constraints at target-side Sample rule/constraint: NP-SB the X1 cat | die AP1 Katze
  • 13. Institut f端r Anthropomatik13 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Training Propagation rules to capture NP/PP agreements: Applied bottom-up
  • 14. Institut f端r Anthropomatik14 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Decoding Model: Every element of rule/constraint has a feature structure Constraint evaluation: Each hypothesis stores set of feature structures corresponding to its root rule element Recombination of hypotheses is possible t=arg max t p(ts) p(ts)= 1 Z i=1 n 了i hi (s ,t)
  • 15. Institut f端r Anthropomatik15 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Evaluation English-German Europarl and News Commentary Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit Treebank for target Grammar: ~140 m rules BLEU scores and p-values for three test sets
  • 16. Institut f端r Anthropomatik16 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Constraints: Conclusion Needed resources: Parallel corpus Heuristics for constraint extraction + Improvement in translation accuracy - Improvement is quite small
  • 17. Institut f端r Anthropomatik17 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Idea 1. Segmentation 2. Tagging 3. Scoring A Class-Based Agreement Model for Generating Accurately Inflected Translations (Green and DeNero, 2012) During Decoding Target-Side Three Steps:
  • 18. Institut f端r Anthropomatik18 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Segmentation Train conditional random field Features: Centered 5-character window During decoding Not as preprocessing step Labels: I: Continuation (Inside) O: Outside (whitespace) B: Beginning F: Non-native chars
  • 19. Institut f端r Anthropomatik19 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Tagging Train CRF on full sentences with gold classes Features: Current and previous words, affixes, etc. Labels: Morphological classes Gender, number, person, definiteness e.g. 89 classes for Arabic Example: 'the car' Tagged: Noun+Def+Sg+Fem
  • 20. Institut f端r Anthropomatik20 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Scoring Scoring of word sequences not comparable across hypotheses Scoring class sequences with generative model Simple bigram LM over gold class sequences (add-1 smoothed) ' =arg max p(Ls) q(e)= p(')=i=1 I p('iO'i1)
  • 21. Institut f端r Anthropomatik21 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Evaluation English-Arabic Training data: variety of sources (e.g. web) Development and Test: NIST sets (Newswire and mixed genre [broadcast news, newsgroups, weblog]) Phrase-based decoder BLEU score for newswire sets BLEU score for mixed genre sets
  • 22. Institut f端r Anthropomatik22 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Class-Based: Conclusion Needed resources: Treebank for target (existing for many languages) Large target corpus + Improves translation quality + Easy to integrate in existing MT system - Increases decoding time - Not very good for mixed genres
  • 23. Institut f端r Anthropomatik23 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel Green, S. and DeNero, J. (2012). A Class-Based Agreement Model for Generating Accurately Inflected Translations. In: ACL. Williams, P. and Koehn, P. (2011). Agreement Constraints for Statistical Machine Translation into German. In: Sixth Workshop on Statistical Machine Translation Minkov, E. and Toutanova, K. (2007) Generating Complex Morphology for Machine Translation. In: ACL. References