Bayesian models of syntactic category acquisition
View/Open
Date
02/07/2013Author
Frank, Stella Christina
Metadata
Abstract
Discovering a word’s part of speech is an essential step in acquiring the grammar of
a language. In this thesis we examine a variety of computational Bayesian models
that use linguistic input available to children, in the form of transcribed child directed
speech, to learn part of speech categories. Part of speech categories are characterised
by contextual (distributional/syntactic) and word-internal (morphological) similarity.
In this thesis, we assume language learners will be aware of these types of cues, and
investigate exactly how they can make use of them.
Firstly, we enrich the context of a standard model (the Bayesian Hidden Markov
Model) by adding sentence type to the wider distributional context.We show that children
are exposed to a much more diverse set of sentence types than evident in standard
corpora used for NLP tasks, and previous work suggests that they are aware of the differences
between sentence type as signalled by prosody and pragmatics. Sentence type
affects local context distributions, and as such can be informative when relying on local
context for categorisation. Adding sentence types to the model improves performance,
depending on how it is integrated into our models. We discuss how to incorporate
novel features into the model structure we use in a flexible manner, and present a second
model type that learns to use sentence type as a distinguishing cue only when it is
informative.
Secondly, we add a model of morphological segmentation to the part of speech categorisation
model, in order to model joint learning of syntactic categories and morphology.
These two tasks are closely linked: categorising words into syntactic categories
is aided by morphological information, and finding morphological patterns in words is
aided by knowing the syntactic categories of those words. In our joint model, we find
improved performance vis-a-vis single-task baselines, but the nature of the improvement
depends on the morphological typology of the language being modelled. This
is the first token-based joint model of unsupervised morphology and part of speech
category learning of which we are aware.