NoSta-D -- Korpus von Nicht-Standardvarietäten des Deutschen
Citation
Description
Corpus of different varieties of German. The subcorpora are subsets of other corpora, specified in parentheses: 1.) historical data (Anselm Corpus), chat data (Dortmund Chat Corpus), learner data (Falko), spoken data (BeMaTaC), literary prose (Kafka); 2.) newspaper texts (TüBa-D/Z). The subcorpora chat, spoken data, prose, and newspaper consist of approximately 5,000 tokens each, historical data of 1,000 tokens, and learner data of 2,900 tokens.
Each subcorpus is annotated with the following information: token and sentence boundaries; normalization; POS tags and dependency relations; named entities; coreference.
Other (German)
Legal ownership: Seminar für Sprachwissenschaft, Universität Tübingen (Tüba-D/Z); Public domain (Kafka); Simone Schultz-Balluff & Klaus-Peter Wegera (Anselm Corpus); Michael Beißwenger & Angelika Storrer (Dortmund Chat Corpus); Institut für deutsche Sprache und Linguistik, HU Berlin (Falko; BaMaTaC); and Anke Lüdeling, Stefanie Dipper, Marc Reznicek, Burkhard Dietterle (Annotations).
Files
Additional details
- Alternative title (English)
- NoSta-D -- A corpus of non-standard varieties of German
- Accuracy
Not specified.
- Completeness
Not specified.
- Conformity
Not specified.
- Consistency
Not specified.
- Credibility
Not specified.
- Processability
Not specified.
- Relevance
Not specified.
- Timeliness
Not specified.
- Understandability
Not specified.