SciTex Corpora

Description

The English Scientific Text Corpus (SciTex) is composed of English scientific research articles and covers nine scientific domains drawn from 38 sources. SciTex contains full journal articles from two time periods, the 1970s/80s (SaSciTex) and the early 2000s (DaSciTex), has a three-way partition: (1) A-subcorpus: computer science (A), (2) B-subcorpus: computational linguistics (B1), bioinformatics (B2), digital construction (B3) and microelectronics (B4), and (3) C-subcorpus: linguistics (C1), biology (C2), mechanical engineering (C3) and electrical engineering (C4). The A and C subcorpus consist of what we call seed disciplines, while we call the B-subcorpus contact disciplines.

Logo of the SciTex corpus, resembling a 4-leafed clover

The sources for the corpus are full journal articles; for each discipline at least two different journals were selected. If journals did not reach back to the 1970s/1980s, other journal sources were used that matched the academic discipline. The sources were collected in the form of pdf files and converted to plain text format. Altogether, SciTex contains around 35 million tokens (i.e., approx. 17.5 million per time slice). A smaller cross-sectional subcorpus of around two million tokens, i.e., one million per time slice, was created that was cleaned from erroneous data produced by the OCR conversion. Additionally, formulas, mostly from computer science, and examples, as in linguistics, were annotated with particular tags to be able to exclude them from linguistic searches. This extensive procedure was important in order to obtain high quality text data at least for a portion of the corpus that can be employed for detailed analyses that may require further (manual) annotation.

The corpus is annotated with word, lemma and part-of-speech attributes on token level. Each text is annotated with meta-information including title, author, year and journal of publication, and scientific register. Further, the corpus is annotated with information about document structure (e.g. section, section type and title, pages). Linguistic annotation includes e.g. formulaic expressions, evaluative patterns, passive and modals.

We used a dedicated pipeline for (1) conversion of the corpus from text files to xml files while maintaining information about document structure (e.g., paragraph sections paragraphs, etc), (2) tokenization, lemmatization and part-of-speech tagging of the corpus files using the TreeTagger (Schmid 1994), (3) transformation of the corpus files into a verticalized text format, and (4) encoding of the corpus for query by the Corpus Query Processor (CQP; Evert 2005).

DaSciTex

Persistent identifier: http://hdl.handle.net/11858/00-246C-0000-0005-BD0F-D

References

Teich, E. and Fankhauser, P. (2009). Exploring a corpus of scientific text using data mining. Language and Computers 71(1), 233–247.

Teich, E. and Holtz, M. (2009). Scientific registers in contact. An exploration of the lexicogrammatical properties of interdisciplinary discourses. International Journal of Corpus Linguistics, 14(4), 524–548.

SaSciTex

Persistent identifier http://hdl.handle.net/11858/00-246C-0000-0023-8CF9-6

References

Degaetano-Ortlieb, S., Kermes, H., Lapshinova-Koltunski, E., and Teich, E. (2013). SciTex - A Diachronic Corpus for Analyzing the Development of Scientific Registers. In P. Bennett, M. Durrell, S. Scheible, and R. J. Whitt (eds.), New Methods in Historical Corpus Linguistics (Vol. 3, S. 93–104). Narr.

Teich, E., Degaetano-Ortlieb, S., Fankhauser, P., Kermes, H. and Lapshinova-Koltunski, E. (2016). The Linguistic Construal of Disciplinarity: A Data Mining Approach Using Register Features. Journal of the Association for Information Science and Technology (JASIST) 67 (7), 1668–1678.

Availability

The corpus is made of copyrighted material and not publically available. For (restricted) access contact Prof. Dr. Elke Teich, Universität des Saarlandes, Saarbrücken.

Funding

Supported by Deutsche Forschungsgemeinschaft (DFG), grant TE 198/1–1 LingPro – Linguistische Profile interdisziplinärer Register (Linguistic Profiles of Interdisciplinary Registers) (2006–2010)

Supported by Deutsche Forschungsgemeinschaft (DFG), grant TE 198/1–2 Regico – Register in Kontakt (Registers in Contact) (2011–2014)

To the Hompage of the UdS CLARIN-D repository | Terms of use | Impressum