The document discusses deep parsing and the parsing process. It describes parsing a sample sentence by highlighting its structural components, part-of-speech tags, and tokens. It also covers extracting segments, creating a semantic chain, normalizing the chain, and adding additional semantic context. The goal is to compare the semantic chain to the parse tree structure.
1 of 35
More Related Content
Deep Parsing (2012)
1. PART 1
Deep Parsing
Craig Trim / craigtrim@gmail.com / CCA 3.0
What you have here are 2 triples connected together; a semantic chain.
Dont look at this diagram with the mis-conception that an ontology is a taxonomy or directed tree. Its not. Its a cyclic network. We do seem to have Software as a root node with most relationships flowing up to the parent. However, in real life, the extracted semantic chain would be one small connection in the midst of an innumerable number of nodes, some in clusters, some in sequences, some apparently random, but all connected and sometimes having multiple connections between 2 nodes and so on.
. Now, youve been a good audience. Thank you. Lets look at some real code and a real process. < CLICK > (END PRESENTATION AND GO TO PART 2)
< CLICK > The first step is to pre-process the input. Pre-processing means we might add or remove tokens, most often punctuation, but we could make other additions. Some degree of normalization might occur here for example an acronym that is spelled I.B.M. might be normalized to IBM or U.S.A to USA. Pattern reduction is a type of normalization it provides a higher degree of uniformity on user input and makes the job of parsing and downstream processing easier. There are simply less variations to account for. However, we generally want to keep pre-processing short and sweet, depending on the needs of our applicatoin. By pre-processing we do have a tendency to lose the user-speak; that is, how a user might choose to refer to an entity or employ nuanced constructions. Also, too much normalization can lead to inaccurate results in the parser. We dont lose anything by changing I.B.M. to IBM, but if we changed the inflected verb installed to the infinitive construction (also called cannonical form, normal form, or lemma) of install we lose the fact that the installation occurred in the past tense. < CLICK > Performing lemmatization at this stage may be appropriate for some applications, but in the main, nuanced speech leads to more accurate parsing results, which in turns leads to higher precision in extracting information of interest. Lemmatization is typically performed in the stage that follows parsing, the post processing stage. < CLICK >. Post processing is really an abstraction of many many many services services that perform not only lemmatization (which is conceptually trivial), but semantic interpolation the adding of additional meaning to the parse tree, as we saw on previous slides. < CLICK >
However, at a high level, this is what happens. The input is pre-processed, parsed, and post-processed. < CLICK >
Lets add a little more context. The user provides input, the input is received, goes through the process we just talked about, and the insight (hopefully there is some) is provided back to the user. The important thing on this diagram is the Intermediate Form. How is the user input represented as it flows through this process? At its simplest, a data transfer object msut exist tha represents the initial input as a String, converts the String into an array of tokens, parses the tokens and stores the structured parse results, and has a mechanism for allowing the structurd output to be enhanced (or simplified) through a number of services, and finally for additional context to be applied and brought to bear upon these results. The design for intermediate representation lies at the heart of every parsing strategy. There are multiple strategies available today. These may vary by architecture, design principle or needs of the application. A parsing strategy that only leverages part of speech tagging is not likely to require a mechanism for storing deep parse results and the additional complexity this incurs. On the other hand, an architecture that can allow a parsing process the simplicity of a few steps, or the complexity of several hundred steps, and be customized without compromise to original design principles is of the most value. Of the many architectures that exist, there are yet many that are this well designed. Ultimiately the strategy you choose will be based on a variety of factors. I do identify this choice as being one of the the most important considerations in the parsing process.