Published February 25, 2014 | Version v1
Dataset Open

NoSta-D -- Korpus von Nicht-Standardvarietäten des Deutschen

  • 1. ROR icon Humboldt-Universität zu Berlin
Contact person:
Lüdeling, Anke1
  • 1. ROR icon Humboldt-Universität zu Berlin

Description

Corpus of different varieties of German. The subcorpora are subsets of other corpora, specified in parentheses: 1.) historical data (Anselm Corpus), chat data (Dortmund Chat Corpus), learner data (Falko), spoken data (BeMaTaC), literary prose (Kafka); 2.) newspaper texts (TüBa-D/Z). The subcorpora chat, spoken data, prose, and newspaper consist of approximately 5,000 tokens each, historical data of 1,000 tokens, and learner data of 2,900 tokens.

Each subcorpus is annotated with the following information: token and sentence boundaries; normalization; POS tags and dependency relations;  named entities; coreference. 

Other (German)

Legal ownership: Seminar für Sprachwissenschaft, Universität Tübingen (Tüba-D/Z); Public domain (Kafka); Simone Schultz-Balluff & Klaus-Peter Wegera (Anselm Corpus); Michael Beißwenger & Angelika Storrer (Dortmund Chat Corpus); Institut für deutsche Sprache und Linguistik, HU Berlin (Falko; BaMaTaC); and Anke Lüdeling, Stefanie Dipper, Marc Reznicek, Burkhard Dietterle (Annotations).

Files

DipperEtAltoappearNOSDAC.pdf
Files (18.1 MiB)
Name Size
md5:7fcadcb833efff6999dee4afcce53324
111.3 KiB Preview Download
md5:819283cbca1f3b8cfc8e038458a876cd
19.1 KiB Preview Download
md5:462c83bce9c94b8a1f9c0dcb511683da
18.0 MiB Download

Additional details

Created:
November 10, 2023
Modified:
November 13, 2023