Data and Models for Statistical Parsing with Combinatory Categorial Grammar
View/Open
Date
07/2003Author
Hockenmaier, Julia
Metadata
Abstract
This dissertation is concerned with the creation of training data and the development of probability
models for statistical parsing of English with Combinatory Categorial Grammar (CCG).
Parsing, or syntactic analysis, is a prerequisite for semantic interpretation, and forms therefore
an integral part of any system which requires natural language understanding. Since almost
all naturally occurring sentences are ambiguous, it is not sufficient (and often impossible) to
generate all possible syntactic analyses. Instead, the parser needs to rank competing analyses
and select only the most likely ones. A statistical parser uses a probability model to perform
this task. I propose a number of ways in which such probability models can be defined for
CCG. The kinds of models developed in this dissertation, generative models over normal-form
derivation trees, are particularly simple, and have the further property of restricting the set of
syntactic analyses to those corresponding to a canonical derivation structure. This is important
to guarantee that parsing can be done efficiently.
In order to achieve high parsing accuracy, a large corpus of annotated data is required to estimate
the parameters of the probability models. Most existing wide-coverage statistical parsers
use models of phrase-structure trees estimated from the Penn Treebank, a 1-million-word corpus
of manually annotated sentences from theWall Street Journal. This dissertation presents an
algorithm which translates the phrase-structure analyses of the Penn Treebank to CCG derivations.
The resulting corpus, CCGbank, is used to train and test the models proposed in this
dissertation. Experimental results indicate that parsing accuracy (when evaluated according to
a comparable metric, the recovery of unlabelled word-word dependency relations), is as high
as that of standard Penn Treebank parsers which use similar modelling techniques.
Most existing wide-coverage statistical parsers use simple phrase-structure grammars whose
syntactic analyses fail to capture long-range dependencies, and therefore do not correspond to
directly interpretable semantic representations. By contrast, CCG is a grammar formalism
in which semantic representations that include long-range dependencies can be built directly
during the derivation of syntactic structure. These dependencies define the predicate-argument
structure of a sentence, and are used for two purposes in this dissertation: First, the performance
of the parser can be evaluated according to how well it recovers these dependencies. In contrast
to purely syntactic evaluations, this yields a direct measure of how accurate the semantic
interpretations returned by the parser are. Second, I propose a generative model that captures
the local and non-local dependencies in the predicate-argument structure, and investigate the
impact of modelling non-local in addition to local dependencies.