際際滷

際際滷Share a Scribd company logo
Using Parallel Propbanks to
 Enhance Word-Alignments
The 3rd Linguistic Annotation Workshop at ACL 09
                  August 7th, 2009


Jinho D. Choi (Univ. of Colorado at Boulder)
Martha Palmer (Univ. of Colorado at Boulder)
Niawen Xue (Brandeis University)
Parallel Propbanks
   Propbank
    -    Corpus annotated with verbal propositions and their
         arguments (semantic roles)
        [ Gansu Province] also actively [ explored ] [ high risk business]
           Arg0: explorer                            Arg1: things explored


   Parallel Propbanks
    -    Propbanks annotated in parallel corpus
        [!!"] " #             [ #$ ] [% $% &']
          Arg0                           Arg1



                                    2
Word-Alignments
   Given parallel sentences, discover translation for each
    word
 !"         #      !     $"      %       &   #   '(     $%    )&


Construction is a principal economic activity in developing Pudong


   GIZA++: a statistical machine translation toolkit
    -   It is hard to verify if the alignments are correct.

    -   Words with low frequencies may not get aligned.

    -   It does not account for semantics.



                                     3
Predicate Matching (based on GIZA++)
    English Chinese Parallel Treebank (ECTB)
    -     Xinhua: Chinese newswire + literal translation

    -     Sinorama: Chinese news magazine + non-literal translation

        Xinhua: 12,895                              Sinorama: 40,086


                                                               19%
    32%
                                    En.verb
                         45%        En.be                          3%
                                    En.else     56%
                                    En.none                      22%
          19%   3%


                                    6
Top-down Argument Matching
   Verify word-alignments
    -   For each Chinese verb vc aligned to some English verb ve

    -   Verify that the alignment is correct if the arguments of
        vc and ve match

         Arg0      ArgM ArgM     Rel                 Arg1
        [ !!" ]    [ " ] [ # ] [ #$ ] [ %            $%      &' ]

[Gansu Province ][ also][ actively] [explored ][ high risk business ]
      Arg0       ArgM ArgM              Rel            Arg1

                                      Bingo!

                                  7
Bottom-up Argument Matching
         Expand word-alignments
          -    For each Chinese verb vc aligned to no English word

          -    Align vc to ve such that ve is an English verb that maximizes
               the argument matching with vc



                     Arg0    A.M A.M A.M       Arg1    Rel
              [ !!" # $" %#] [ &] [' ][ ( ][ $ )" %&] [ ']


[ Foreign funded enterprises in Gansu Province][ no][longer ][worry about investment risk ]
                                                                        ][
                     Arg0                      A.M A.M            Rel          Arg1



                                             8
Bottom-up Argument Matching
         Expand word-alignments
          -    For each Chinese verb vc aligned to no English word

          -    Align vc to ve such that ve is an English verb that maximizes
               the argument matching with vc
  ArgM        Rel       Arg1
[Foreign ][ funded ][enterprises]in Gansu Province no longer worry about investment risk


              [ !!" # $" %#] [ &] [' ][ ( ][ $ )" %&] [ ']
                    Arg0     A.M A.M A.M       Arg1    Rel

[ Foreign funded enterprises in Gansu Province][ no][longer ][worry about investment risk ]
                                                                        ][
                     Arg0                      A.M A.M            Rel          Arg1



                                             8
Argument Matching Score
   Macro argument matching score




   Micro argument matching score




   Thresholds
    -   Top-down: thresholds on macro score

    -   Bottom-up: thresholds on both macro and micro scores



                                9
System Overview
Source Language                     Target Language
    Corpus                              Corpus
                        GIZA++


                          Word
 Verbs aligned         Alignments    Verbs aligned
   to verbs                           to no word
                        Parallel
   Top-down            Propbanks      Bottom-up
    Matching                           Matching


   Veri鍖ed                            Expanded
  Alignments                          Alignments
                       Enhanced
                       Alignments

                           10
Evaluations
   Test Corpus
    -   NIST-GALE Web Genre Test Data

    -   100 parallel sentences, 365 verb tokens, 273 verb types

   Measurements
    -   Term Coverage
        : how many Chinese verb-types are covered

    -   Term Expansion
        : how many English verb-types are suggested

    -   Alignment Accuracy
        : how many suggested English verb-types are correct



                                 11
Evaluations: Top-down
    Mac.th = 0.0 (GIZA++)                Mac.th = 0.5 (TDAM)
                               Term Coverage
        130.0
                                          129
         97.5
         65.0        79        76
                                                  62
         32.5
            0
                          Xinhua            Sinorama
                         Average Alignment Accuracy
90.0%
67.5%           83.35%     83.71%                 78.09%
45.0%                                    57.76%
22.5%
   0%
                    Xinhua                   Sinorama
                                  12
Evaluations: Bottom-up
                         Mac.th = 0.8, Mic.th = 0.6

                                Term Coverage
             30.0
             22.5                                27
             15.0         18
              7.5
                0
5.5% error-reduction    Xinhua               Sinorama
17% abs-improvement     Average Alignment Accuracy
         70.0%
         52.5%         63.89%
         35.0%
         17.5%
            0%                                  14.46%
                       Xinhua                   Sinorama
                                   13
Conclusions & Future Work
   Conclusions
    -   Top-down Argument Matching is most effective for verifying
        word-alignments based on non-literal translations that have
        proven dif鍖cult for GIZA++.

    -   Bottom-up Argument Matching shows promise for expanding
        the coverage of GIZA++ alignments based on literal
        translations.

   We will try to enhance word-alignments by using
    -   Automatically labeled Propbanks

    -   Nombanks, Named-entity tags

    -   Parallel Propbanks prior to GIZA++


                                 14
Acknowledgements
   We gratefully acknowledge the support of the National
    Science Foundation Grants IIS-0325646, Domain
    Independent Semantic Parsing, CISE-CRI-0551615,
    Towards a Comprehensive Linguistic Annotation, and a
    grant from the Defense Advanced Research Projects
    Agency (DARPA/IPTO) under the GALE program,
    DARPA/CMO Contract No. HR0011-06-C-0022,
    subcontract from BBN, Inc.
   Special thanks to Daniel Gildea, Ding Liu (University of
    Rochester) who provided word-alignments, Wei Wang
    (Information Sciences Institute at University of Southern
    California) who provided the test-corpus, and Hua
    Zhong (University of Colorado at Boulder) who
    performed the evaluations.

                             15

More Related Content

Using Parallel Propbanks to Enhance Word-alignments

  • 1. Using Parallel Propbanks to Enhance Word-Alignments The 3rd Linguistic Annotation Workshop at ACL 09 August 7th, 2009 Jinho D. Choi (Univ. of Colorado at Boulder) Martha Palmer (Univ. of Colorado at Boulder) Niawen Xue (Brandeis University)
  • 2. Parallel Propbanks Propbank - Corpus annotated with verbal propositions and their arguments (semantic roles) [ Gansu Province] also actively [ explored ] [ high risk business] Arg0: explorer Arg1: things explored Parallel Propbanks - Propbanks annotated in parallel corpus [!!"] " # [ #$ ] [% $% &'] Arg0 Arg1 2
  • 3. Word-Alignments Given parallel sentences, discover translation for each word !" # ! $" % & # '( $% )& Construction is a principal economic activity in developing Pudong GIZA++: a statistical machine translation toolkit - It is hard to verify if the alignments are correct. - Words with low frequencies may not get aligned. - It does not account for semantics. 3
  • 4. Predicate Matching (based on GIZA++) English Chinese Parallel Treebank (ECTB) - Xinhua: Chinese newswire + literal translation - Sinorama: Chinese news magazine + non-literal translation Xinhua: 12,895 Sinorama: 40,086 19% 32% En.verb 45% En.be 3% En.else 56% En.none 22% 19% 3% 6
  • 5. Top-down Argument Matching Verify word-alignments - For each Chinese verb vc aligned to some English verb ve - Verify that the alignment is correct if the arguments of vc and ve match Arg0 ArgM ArgM Rel Arg1 [ !!" ] [ " ] [ # ] [ #$ ] [ % $% &' ] [Gansu Province ][ also][ actively] [explored ][ high risk business ] Arg0 ArgM ArgM Rel Arg1 Bingo! 7
  • 6. Bottom-up Argument Matching Expand word-alignments - For each Chinese verb vc aligned to no English word - Align vc to ve such that ve is an English verb that maximizes the argument matching with vc Arg0 A.M A.M A.M Arg1 Rel [ !!" # $" %#] [ &] [' ][ ( ][ $ )" %&] [ '] [ Foreign funded enterprises in Gansu Province][ no][longer ][worry about investment risk ] ][ Arg0 A.M A.M Rel Arg1 8
  • 7. Bottom-up Argument Matching Expand word-alignments - For each Chinese verb vc aligned to no English word - Align vc to ve such that ve is an English verb that maximizes the argument matching with vc ArgM Rel Arg1 [Foreign ][ funded ][enterprises]in Gansu Province no longer worry about investment risk [ !!" # $" %#] [ &] [' ][ ( ][ $ )" %&] [ '] Arg0 A.M A.M A.M Arg1 Rel [ Foreign funded enterprises in Gansu Province][ no][longer ][worry about investment risk ] ][ Arg0 A.M A.M Rel Arg1 8
  • 8. Argument Matching Score Macro argument matching score Micro argument matching score Thresholds - Top-down: thresholds on macro score - Bottom-up: thresholds on both macro and micro scores 9
  • 9. System Overview Source Language Target Language Corpus Corpus GIZA++ Word Verbs aligned Alignments Verbs aligned to verbs to no word Parallel Top-down Propbanks Bottom-up Matching Matching Veri鍖ed Expanded Alignments Alignments Enhanced Alignments 10
  • 10. Evaluations Test Corpus - NIST-GALE Web Genre Test Data - 100 parallel sentences, 365 verb tokens, 273 verb types Measurements - Term Coverage : how many Chinese verb-types are covered - Term Expansion : how many English verb-types are suggested - Alignment Accuracy : how many suggested English verb-types are correct 11
  • 11. Evaluations: Top-down Mac.th = 0.0 (GIZA++) Mac.th = 0.5 (TDAM) Term Coverage 130.0 129 97.5 65.0 79 76 62 32.5 0 Xinhua Sinorama Average Alignment Accuracy 90.0% 67.5% 83.35% 83.71% 78.09% 45.0% 57.76% 22.5% 0% Xinhua Sinorama 12
  • 12. Evaluations: Bottom-up Mac.th = 0.8, Mic.th = 0.6 Term Coverage 30.0 22.5 27 15.0 18 7.5 0 5.5% error-reduction Xinhua Sinorama 17% abs-improvement Average Alignment Accuracy 70.0% 52.5% 63.89% 35.0% 17.5% 0% 14.46% Xinhua Sinorama 13
  • 13. Conclusions & Future Work Conclusions - Top-down Argument Matching is most effective for verifying word-alignments based on non-literal translations that have proven dif鍖cult for GIZA++. - Bottom-up Argument Matching shows promise for expanding the coverage of GIZA++ alignments based on literal translations. We will try to enhance word-alignments by using - Automatically labeled Propbanks - Nombanks, Named-entity tags - Parallel Propbanks prior to GIZA++ 14
  • 14. Acknowledgements We gratefully acknowledge the support of the National Science Foundation Grants IIS-0325646, Domain Independent Semantic Parsing, CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation, and a grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc. Special thanks to Daniel Gildea, Ding Liu (University of Rochester) who provided word-alignments, Wei Wang (Information Sciences Institute at University of Southern California) who provided the test-corpus, and Hua Zhong (University of Colorado at Boulder) who performed the evaluations. 15