Colson, Jean-Pierre
[UCL]
There has already been some interaction between computational linguistics and construction grammar. Using Fluid Construction Grammar (FCG), Steels (1998, 2013), has led to the development of a sophisticated computational model for the implementation of a fully constructionist approach to parsing and language production. In FCG, constructions are modelled in sets and networks, which is compatible with the view taken by Cognitive Construction Grammar (Boas 2003, 2006, 2009, 2013; Ziem 2008) that a complex network of constructions is at stake in language, ranging from abstract to partly schematic and specific constructions. As several authors have pointed out (Croft 2013, Stefanovitsch 2013), the constructicon may be conceived of as a probabilistic network of nodes and inheritance links, which also raises the question of the statistical associations underlying construction grammar. In particular, collostructional analysis (Gries 2007, 2013; Gries & Stefanowitsch 2003, 2004a, 2004b, 2010; Stefanowitsch 2006, 2013), has indicated that the attraction between various elements of constructions may be of a probabilistic nature. If this assumption is correct, it should be possible to use NLP for the automatic extraction of constructions, or at least for measuring the attraction between their component parts, be they of an abstract, schematic or specific nature. In this respect, edition 1.1. of the Parseme shared task, organized at Coling 2018 (Santa Fe, USA) made it possible to bridge the gap between research on the extraction of multiword expressions (MWEs), phraseology in the broad sense, and constructions. By bringing together techniques borrowed from Information Retrieval and Deep Learning, a new picture of the statistical nature of notions such as entrenchment, fixedness, association and idiomaticity is emerging. In particular, recent research on compositionality prediction (Cordeiro, Villavicencio, Idiart and Ramisch 2019) sheds a new light on the notion of idiomaticity by means of a Distributional Semantic Model (DSM) using cosine similarity between vectors. While this approach is very promising for determining the idiomatic meaning of nominal compounds, its extension to longer n-grams, let alone to constructions, poses the practical question of the computational cost, and the theoretical issue of figurative language as a culture-specific or language-specific notion. In this paper we will build upon another model often used in Information Retrieval, metric clusters, and show how the low computational cost of this approach makes it possible to gather information on the statistical features of partly schematic constructions (constructional idioms) such as let alone / geschweige denn. As proposed in Colson (2017, 2018), we use a query likelihood model in order to implement databases with corpora of about 1 billion tokens in several languages. A new clustering algorithm is used in order to achieve phraseology extraction (idiomsearch.lsti.ucl.ac.be) but also Chinese word segmentation (Colson 2018). We will argue that Chinese word segmentation offers a very valuable feedback for the efficiency of construction extraction, because of the much higher degree of inter-individual agreement for establishing a gold set, as compared to phraseology extraction. While it is still being supervised and tested on the basis of new data, our methodology already makes it possible to check correlating elements between grammatical categories and constructions (e.g. the negation in combination with let alone / geschweige denn, within a window of x tokens). For this purpose, we use (as in the case of Cordeiro et al. 2019) automatically tagged corpora in preference to parsed corpora, because of the smaller number of errors and of the much lower computational cost.
Bibliographic reference |
Colson, Jean-Pierre. Automatic extraction of constructions and statistical associations in the constructicon: an exploratory methodology.Towards a Multilingual Constructicon:Issues, Approaches, Perspectives (University of Düsseldorf, du 04/12/2019 au 06/12/2019). |
Permanent URL |
http://hdl.handle.net/2078.1/224142 |