The document discusses three approaches for handling grammatical agreement in statistical machine translation (SMT): morphological generation, agreement constraints, and a class-based agreement model. Morphological generation predicts inflections for word stems to comply with agreement rules. Agreement constraints add constraints to an SCFG target grammar. A class-based model segments, tags with morphological classes, and scores based on class sequences to improve inflected translations. The approaches are evaluated on English-German and English-Arabic data, showing improvements over baselines.
1 of 23
Download to read offline
More Related Content
Grammatical Agreement in SMT
1. Institut f端r Anthropomatik1 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Grammatical Agreement in SMT
Seminar Sprach-zu-Sprach-bersetzung
SS 2013
2. Institut f端r Anthropomatik2 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Inflection
Modification of a word
signals grammatical variants (tense, gender, case, )
e.g. walk vs. Walked
Agreement
Inflection for related words in a sentence has to agree
e.g. das Haus vs. die Haus
Some languages are weakly inflected (e.g. English)
Some are highly inflected (e.g. German, Arabic, )
Inflection and Agreement
3. Institut f端r Anthropomatik3 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Local Agreement Errors
Ref:
the-carF
goF
with-speed
Hypo:
the-carF
goM
with-speed
Long-distance Agreement Errors
Ref: celle qui parle , cest ma femme
oneF
who speak , is my wifeF
Hypo: celui qui parle est ma femme
oneM
who speak is my spouseF
Agreement Errors
4. Institut f端r Anthropomatik4 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Approaches for SMT
Morphological Generation
Create raw stems and modify with predicted inflection
Agreement Constraints
Use SCFG of target and add constraints to it
Class-based Agreement Model
Use morphological word classes Noun+Def+Sg+Fem
5. Institut f端r Anthropomatik5 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Morphological Generation: Idea
Generating Complex Morphology for Machine Translation (Minkov
and Toutanova, 2007)
Convert MT output to stem sequence
Predict an inflection for every stem
Reflect meaning and comply with agreement rules
6. Institut f端r Anthropomatik6 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Morphological Generation: Lexicons
Morphology analysis and generation
Operations:
Stemming
Inflection
Morphological analysis
Create manually
Create automatically from data
Here: assumed as given
7. Institut f端r Anthropomatik7 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Morphological Generation: Inflection Prediction
Maximum Entropy Markov model (2nd
order)
Features:
Monolingual
Bilingual
Lexical
Morphological
Syntactic
p(yLx)=t=1
n
p(yt yt1 , yt2 , xt ) , yt It
8. Institut f端r Anthropomatik8 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Morphological Generation: Evaluation
English-Russian and English-Arabic
Technical (software manual) domain
Input: Aligned sentence pairs of reference translations (no output of MT
System) reduce noise
Accuracy (%) results
9. Institut f端r Anthropomatik9 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Morphological Generation: Conclusion
Needed resources:
Large corpus of aligned sentence pairs
Lexicons (source and target) with the three operations
+ Better accuracy than simple LM (even with small training data)
+ Easy to add to existing MT system
- Expensive creation of lexicons
10. Institut f端r Anthropomatik10 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Idea
Agreement Constraints for Statistical Machine Translation into
German (Williams and Koehn, 2011)
String-to-tree model
Synchronous grammar for target language
Adding learned constraints and probabilities
Evaluation of constraints during decoding
11. Institut f端r Anthropomatik11 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Feature Structure
Feature structure
Unification
12. Institut f端r Anthropomatik12 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Grammar
Synchronous grammar learned from parallel corpus
Extended by constraints at target-side
Sample rule/constraint:
NP-SB the X1
cat | die AP1
Katze
13. Institut f端r Anthropomatik13 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Training
Propagation rules to
capture NP/PP agreements:
Applied bottom-up
14. Institut f端r Anthropomatik14 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Decoding
Model:
Every element of rule/constraint has a feature structure
Constraint evaluation: Each hypothesis stores set of feature structures
corresponding to its root rule element
Recombination of hypotheses is possible
t=arg max
t
p(ts)
p(ts)=
1
Z
i=1
n
了i hi (s ,t)
15. Institut f端r Anthropomatik15 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Evaluation
English-German
Europarl and News Commentary
Parsing: BitPar; Alignment: GIZA++; SCFG rules: Moses toolkit
Treebank for target
Grammar: ~140 m rules
BLEU scores and p-values for three test sets
16. Institut f端r Anthropomatik16 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Constraints: Conclusion
Needed resources:
Parallel corpus
Heuristics for constraint extraction
+ Improvement in translation accuracy
- Improvement is quite small
17. Institut f端r Anthropomatik17 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Idea
1. Segmentation
2. Tagging
3. Scoring
A Class-Based Agreement Model for Generating Accurately Inflected
Translations (Green and DeNero, 2012)
During Decoding
Target-Side
Three Steps:
18. Institut f端r Anthropomatik18 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Segmentation
Train conditional random field
Features:
Centered 5-character window
During decoding
Not as preprocessing step
Labels:
I: Continuation (Inside)
O: Outside (whitespace)
B: Beginning
F: Non-native chars
19. Institut f端r Anthropomatik19 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Tagging
Train CRF on full sentences with gold classes
Features:
Current and previous words, affixes, etc.
Labels:
Morphological classes
Gender, number, person, definiteness
e.g. 89 classes for Arabic
Example:
'the car'
Tagged: Noun+Def+Sg+Fem
20. Institut f端r Anthropomatik20 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Scoring
Scoring of word sequences not comparable across hypotheses
Scoring class sequences with generative model
Simple bigram LM over gold class sequences (add-1 smoothed)
' =arg max
p(Ls)
q(e)= p(')=i=1
I
p('iO'i1)
21. Institut f端r Anthropomatik21 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Evaluation
English-Arabic
Training data: variety of sources (e.g. web)
Development and Test: NIST sets (Newswire and mixed genre
[broadcast news, newsgroups, weblog])
Phrase-based decoder
BLEU score for newswire sets
BLEU score for mixed genre sets
22. Institut f端r Anthropomatik22 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Class-Based: Conclusion
Needed resources:
Treebank for target (existing for many languages)
Large target corpus
+ Improves translation quality
+ Easy to integrate in existing MT system
- Increases decoding time
- Not very good for mixed genres
23. Institut f端r Anthropomatik23 24.06.13 Simon Hummel Lehrstuhl Prof. Waibel
Green, S. and DeNero, J. (2012). A Class-Based Agreement Model for
Generating Accurately Inflected Translations. In: ACL.
Williams, P. and Koehn, P. (2011). Agreement Constraints for Statistical
Machine Translation into German. In: Sixth Workshop on Statistical
Machine Translation
Minkov, E. and Toutanova, K. (2007) Generating Complex Morphology
for Machine Translation. In: ACL.
References