Published November 2009 | Version v1
Dataset Restricted

Tübinger Baumbank des Deutschen / Zeitungskorpus (TüBa-D/Z)


Die Tübinger Baumbank des Deutschen / Schriftsprache (TüBa-D/Z) ist ein syntaktisch annotiertes Korpus auf der Grundlage der Zeitung "die tageszeitung" (taz). Die Annotation ist (weitestgehend) theorieunabhängig.  Das Annotationsschema unterscheidet vier Ebenen syntaktischer Konstituenz: die lexikalische Ebene, die phrasale Ebene, die Ebene der topologischen Felder und die Satzebene. 

Other (English)

The Tübingen Treebank of Written German / Newspaper Corpus (TüBa-D/Z) is a syntactically anontated corpus based on the newspaper  "die tageszeitung" (taz). Annotation is as far as possible theory independent, the schema distinguishing four levels of syntactic constituence: the lexical level, the phrasal level, topological fields and sentences.

Technical info (English)

Stuttgart-Tübingen TagSet (STTS): All tokens are POS annotated with the Stuttgart-Tübingen TagSet (STTS). The STTS tag set consists of 54 tokens. They are used according to their definition, except for the tag "PAV" that was replaced with "PROP".   Annotation Level Type: Morphological information is added for each token. Individual tag set developed for the TüBa-D/Z. It consists of 435 tags that are the number of morpho-syntactical feature-value combinations that can assign specific morpho-syntacticale values for each token.  E.g. apn2 = accusative plural neuter 2nd person. Simple named entities, e.g. "Berlin" are named on the POS level with the tag "NE".ntities, e.g. "Berlin" are named on the POS level with the tag "NE". Complex named entities are marked on the syntactical level with the node label "EN-ADD" or with the secondary edge label "EN". Their categorization into classes of persons, places and organizations is in progress.The annotation on the POS level performed semi-automatically. On the syntactic level, the annotation is performed manually.   Referential relations:  7 kinds of referential relations are annotated: coreferential, anaphoric, cataphoric, bound, split, antecedent, instance of, and expletive. Additionally, inherent reflexives can be distinguished, as they don't have any reference. The annotation of coreference relations is done for each newspaper article annotated in the treebank.   The corpus is also annotated for grammatical functions.   Used annotation tool: @nnotate, see Annotate dient der komfortablen und effizienten,  semi-automatischen Annotation von Korpusdaten. Used annotation Tool: PALinkA, see PALinkA was used for the annotation of coreference relations.


November 9, 2023
