Titel
The International Comparable Corpus: Challenges in building multilingual spoken and written comparable corpora
Autor*in
Ann Čermáková
Charles University
Autor*in
Jarmo Jantunen
University of Jyväskylä
Autor*in
Tommi Jauhiainen
University of Helsinki
... show all
Abstract
This paper reports on the efforts of twelve national teams in building the International Comparable Corpus (ICC; https://korpus.cz/icc) that will contain highly comparable datasets of spoken, written and electronic registers. The languages currently covered are Czech, Finnish, French, German, Irish, Italian, Norwegian, Polish, Slovak, Swedish and, more recently, Chinese, as well as English, which is considered to be the pivot language. The goal of the project is to provide much-needed data for contrastive corpus-based linguistics. The ICC corpus is committed to the idea of re-using existing multilingual resources as much as possible and the design is modelled, with various adjustments, on the International Corpus of English (ICE). As such, ICC will contain approximately the same balance of forty percent of written language and 60 percent of spoken language distributed across 27 different text types and contexts. A number of issues encountered by the project teams are discussed, ranging from copyright and data sustainability to technical advances in data distribution.
Stichwort
ICC corpuscontrastive linguisticscomparable corpusICE corpusdata sustainabilitycopyright
Objekt-Typ
Sprache
Englisch [eng]
Persistent identifier
https://phaidra.univie.ac.at/o:1602684
Erschienen in
Titel
Research in Corpus Linguistics
Band
9
Ausgabe
1
ISSN
2243-4712
Erscheinungsdatum
2021
Seitenanfang
89
Seitenende
103
Verlag
Research in Corpus Linguistics
Erscheinungsdatum
2021
Zugänglichkeit
Rechteangabe
(c) 2021 Research in Corpus Linguistics

Herunterladen

Universität Wien | Universitätsring 1 | 1010 Wien | T +43-1-4277-0