NoSta-D -- Korpus von Nicht-Standardvarietäten des Deutschen

Creators: Dipper, Stefanie¹; Lüdeling, Anke¹; Reznicek, Marc¹

1. Humboldt-Universität zu Berlin

Contact person:: Lüdeling, Anke¹

1. Humboldt-Universität zu Berlin

Description

Corpus of different varieties of German. The subcorpora are subsets of other corpora, specified in parentheses: 1.) historical data (Anselm Corpus), chat data (Dortmund Chat Corpus), learner data (Falko), spoken data (BeMaTaC), literary prose (Kafka); 2.) newspaper texts (TüBa-D/Z). The subcorpora chat, spoken data, prose, and newspaper consist of approximately 5,000 tokens each, historical data of 1,000 tokens, and learner data of 2,900 tokens.

Each subcorpus is annotated with the following information: token and sentence boundaries; normalization; POS tags and dependency relations; named entities; coreference.

Other (German)

Legal ownership: Seminar für Sprachwissenschaft, Universität Tübingen (Tüba-D/Z); Public domain (Kafka); Simone Schultz-Balluff & Klaus-Peter Wegera (Anselm Corpus); Michael Beißwenger & Angelika Storrer (Dortmund Chat Corpus); Institut für deutsche Sprache und Linguistik, HU Berlin (Falko; BaMaTaC); and Anke Lüdeling, Stefanie Dipper, Marc Reznicek, Burkhard Dietterle (Annotations).

Files

DipperEtAltoappearNOSDAC.pdf

Files (18.1 MiB)

Name	Size	Actions
DipperEtAltoappearNOSDAC.pdf md5:7fcadcb833efff6999dee4afcce53324	111.3 KiB	Preview Download
CMDI.xml md5:819283cbca1f3b8cfc8e038458a876cd	19.1 KiB	Preview Download
NoSta-D.tar.gz md5:462c83bce9c94b8a1f9c0dcb511683da	18.0 MiB	Download

Additional details