Diccionari electrònic bilingüe català>anglés de locucions referencials idiomàtiques de somatismes

Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10045/122159
Información del item - Informació de l'item - Item information
Título: Diccionari electrònic bilingüe català>anglés de locucions referencials idiomàtiques de somatismes
Autor/es: Escolano Marín, Xènia
Director de la investigación: Sánchez-López, Elena | Martines, Vicent
Centro, Departamento o Servicio: Universidad de Alicante. Departamento de Filología Catalana
Palabras clave: Electronic Dictionary | Phraseology | Idioms | Somatisms | Lexicogrammar | Syntactic-Semantic Tagging, | Corpus-Based Methodology | Interlinguistic Equivalents.
Área/s de conocimiento: Filología Catalana
Fecha de creación: 2021
Fecha de publicación: 2021
Fecha de lectura: 22-jul-2021
Editor: Universidad de Alicante
Resumen: The ultimate aim of this research is to design a Catalan>English bilingual electronic dictionary of somatic idioms. To do so, we undertake a semasiological characterisation of somatic idioms to determine their semantic value and morphosyntactic combinatorial possibilities going from language to abstraction, identifying the differences in conceptualisation between Catalan and English and their equivalence relationship on the basis of their occurrences in corpora. The methodology used for the description of these somatisms is based on lexicogrammar but in a bottom-up manner. The theory of lexicogrammar (M. Gross 1975, 1981) holds that every elementary phrase is constituted by at least one first order predicate that introduces its arguments, represented by nouns or phrases. For example, in the phrase 'Luc admire le courage de Léa', we can find the verbal predicate 'admirer' with its arguments 'Luc' and 'Léa', that are morphologically equivalent to 'Luc est admiratif (pour + devant) le courage de Léa', with an adjective predicate, and 'Luc a de l'admiration pour le courage de Léa', with a nominal predicate, since there is no change in the structure of arguments (Le Pesant & Mathieu-Colas 1998: 8). This entails a systematic description of the syntactic and semantic properties of verbs, predicate nouns and adjectives in the form of tables that represent a class of lexical elements –characterised in terms of syntactic-semantic features (such as human, animal, plant, concrete, abstract, locative and temporal)–which correspond to a certain syntactic category with a series of common distributional and transformational properties. However, this description is insufficient for the complete automatic treatment of language; this is why in 1992 G. Gross adds to lexicogrammar the notion of object classes: semantic groups defined by the syntactic relations they maintain with one or more classes of verbs, called appropriate predicates (Gross 2012: 101). This exhaustive description of the language has made it possible to consider certain linguistic phenomena like fixation for the automatic treatment of language. Considering that computerised corpora play a very prominent role in usage-based linguistics, a trend that considers grammar and use to be closely interrelated, lexicogrammar can represent a theoretical model particularly suitable for the detection and analysis of phraseological units (PhUs) in corpora. In particular, we focus on the description of the 50 most frequent idioms in Catalan containing one of the five most common anthropomorphic somatic lexemes in all languages: hand, head, heart, eye and ear (Mellado 2004). To identify which are these 50 most frequent idioms (10 per each lexeme), we retrieve all the occurrences for each of the five mentioned lexemes from the Corpus Textual Informatitzat de la Llengua Catalana (CTILC) (~ 52M words with texts from 1833-1988). Then we carry out a semi-automatic extraction of the different combinations of the most frequent bigrams (candidates for idioms of the dictionary) for each somatic lexeme using the software Metaconcor. For example, for mà (hand), we obtain: a mà (at hand), a mà (by hand), entre mans (on [one’s] hands), de mà en mà (from hand to hand), mà d’obra (labour force), mà dura (firm hand), picar de mans (to clap [one’s hands]), lligar de peus i mans (to tie [sb’s] hands and feet), a mans plenes (liberally) and rentar-se’n les mans (to wash [one’s] hands [of]). We also consider the most frequent verbal and nominal co-occurrences of each bigram. Once we have translated the Catalan idioms into English, we apply this step to the English equivalents based on their occurrences in the British National Corpus (BNC) (~ 100M words with texts from 1985-1994). In order to design the dictionary, we undertake a syntactic-semantic description of the Catalan idioms and their equivalents in English –indicating its argument structures and semantic values according to the object class to which we ascribe them (the one referring to the values of <somatisms>)–, based on the occurrences of these units offered by the CTILC for Catalan and the BNC for English. This analysis is expressed through a tagging recognised by automatic language processing systems, based on the one used by the Laboratoire de Linguistique Informatique (LLI) –current LDI (Lexiques, Dictionnaires, Informatique) from the Paris 13 University–, whose dictionaries follow the system of the Laboratoire d’Automatique Documentaire et Linguistique (LADL) (Paris 7) –founded by M. Gross in 1968– and incorporate the notion of G. Gross’ object classes (1992). Having identified the most frequent idioms and their English equivalents, as well as their distributional (syntagmatic relations) and transformational possibilities (paradigmatic relations), we give account of their semantic relations (for both languages): conceptual variations (polysemy), intersynonymic variants (relations of synonymy), their eventual relations of antonymy and hyperonymy and their paradigmatic variants (according to their occurrences in the corpora used). All this information is registered in the dictionary in the form of four coordinated files with fields of different nature (cf. 2): two files containing the arguments of the idioms (one file for Catalan and another file for English) and two files containing the predicates (i.e. the idioms as entries) (one file for Catalan and another one for English). Considering the relevance object classes may have in fixed expressions, our dictionary mostly contains idioms –predicates– that combine invariable (argument human nouns) and variable (anthropomorphic somatic lexemes) elements –arguments–: for example, to wash <one’s> hands (of)/C:<so-ma:elre>/G:v/N0:<hum>/N1:<so-ma:elre>/N3:<ina>/Ca:rentar-se les mans (d’) <alguna cosa>, where N0 refers to the subject (a human), and N1 to the first complement (direct object) in the argument structure of this idiom with the somatic lexeme (so) hand, which in this case has a semantic value of refusal of responsibility (elre) (in other cases this lexeme will be in idioms which may evoke, among others, proximity or facility [at hand], manual labour [by hand], activity [on (one’s) hands], severity [firm hand], itinerancy [from hand to hand], human resources [labour force], approval, enthusiasm or attention [to clap (one’s hands)], immobility or repression [to tie (sb’s) hands and feet] and abundance [liberally], which will be reflected in the argument file). N3 is the third complement, which here corresponds to a concrete inanimate object (<ina>)(e.g. “I’m entirely opposed to the er (pause) the idea that they should wash their hands of their (pause) er, obligations er, the nineteen sixty eight act (pause) as” [BNC]). In this case there is no N2 complement, which usually refers to an indirect object. Ca indicates the Catalan equivalent of the idiom. Opting for an electronic phraseological dictionary with a semasiological approach implies, on the one hand, finding the PhUs contained in it more easily, since instead of departing from specific concepts, it departs from the semantic values of the units (idioms) grouped in a linguistically well-defined object class. On the other, it offers the advantage of being suitable for Natural Language Processing, as it has a coding recognised by automatic language processing systems, incorporating numerous fields with information of morphological, syntactic-semantic (the most common distributional and transformational properties) and diasystematic nature (if applicable). It also contains a specific field referred to translations into other languages (bilingual or multilingual). Therefore, these repertoires have a dual function: decoding and encoding information. In fact, one of the main advantages offered by semasiological electronic dictionaries over more traditional ones is that they present in different entries each of the argument structures of a predicate, which allows it to be monosemised. Thus, each usage of a predicate is conceived as a lexical unit to which a description is assigned, a sine qua non for machine translation. This is particularly relevant to hybrid systems, which combine statistics with syntax and semantics (e.g. Systran and Sadaw). The eventual product derived from this thesis, as well as the linguistic data (methodology, examples, tagging and files), can be considered unprecedented. Until now, lexicogrammar had been implemented based on linguistic intuition, which could make the creation of object classes somehow biased by the linguist’s idiolect. Using a syntactic-semantic tagging based on the above-mentioned methodology enables precision and objectivity, starting from real language instances (occurrences of these idioms in corpora). All in all, the usability and replicability of the linguistic data provided in the research may offer a wide range of possibilities, since this tagging could be implemented into linguistic analysis tools (e.g. in the form of tags in the “part of speech” label if idioms are used) and the model could be replicated to expressions from other lexico-phraseological fields, different types of PhUs and other languages.
Patrocinador/es: La present tesi doctoral ha rebut el finançament del contracte d’investigació predoctoral Ayudas para la Formación de Profesorado Universitario (Ref. FPU17/0032) concedit pel Ministerio de Educación, Cultura y Deporte d’Espanya (actual Ministerio de Ciencia, Innovación y Universidades). El projecte de tesi s’ha inscrit en el si de l’Institut Superior d’Investigació Cooperativa IVITRA [ISIC-IVITRA] (Programa per a la Constitució i Acreditació d’Instituts Superiors d’Investigació Cooperativa d’Excel∙lència de la Generalitat Valenciana, Ref. ISIC/012/042) i s’ha desenvolupat en el marc dels projectes, xarxes i grups de recerca següents: «Variación y cambio lingüístico en catalán. Una aproximación diacrónica según la Lingüística de Corpus» (MICINUN, Ref. PGC2018-099399-B-100371); (IEC, Ref. PRO2018-S04-MARTINES); Grup d’Investigació VIGROB-125 de la UA; Xarxa de recerca en innovació en docència universitària «Lingüística de Corpus i Mediterrània intercultural: investigació educativa per a l’aplicació de la Lingüística de Corpus en entorns multilingües diacrònics. Aplicacions del Metacorpus CIMTAC» (Institut de Ciències de l’Educació de la UA, Ref. 4581-2018), i Grup d’Investigació en Tecnologia Educativa en Història de la Cultura, Diacronia lingüística i Traducció (Universitat d’Alacant, Ref. GITE-09009-UA).
URI: http://hdl.handle.net/10045/122159
Idioma: cat
Tipo: info:eu-repo/semantics/doctoralThesis
Derechos: Licencia Creative Commons Reconocimiento-NoComercial-SinObraDerivada 4.0
Aparece en las colecciones:Tesis doctorales

Archivos en este ítem:
Archivos en este ítem:
Archivo Descripción TamañoFormato 
Thumbnailtesi_xenia_escolano_marin.pdf13,83 MBAdobe PDFAbrir Vista previa


Este ítem está licenciado bajo Licencia Creative Commons Creative Commons