The Stockholm University Strindberg Corpus (SUSC) consists of seven novels by August Strindberg annotated for parts-of-speech with morphological analysis and lemmas. The corpus is freely available.
SUSC consists of approximately 400 000 tokens annotated for parts-of-speech, including morphological analysis and lemmas, using the Stockholm-Umeå Corpus tag set in PAROLE-format. The annotated texts have been converted to XML which makes the corpus searchable with corpus analysis tools such as Xaira. This allows for e.g., searching for concordances with a specific wordform, part-of-speech and/or lemma, for pattern matching, and collocation extraction.
The current version of the corpus includes seven works which can be classified as autobiographical:
- Tjänstekvinnans son (The son of a servant, 1886-87)
- Han och hon (He and she, 1919)
- Inferno (Inferno, 1897)
- Legender and Jakob brottas (Legends and Jacob wrestles, 1898)
- Fagervik och Skamsund (Fair haven and Foulstrand, 1902)
- Ensam (Alone, 1903)
We are aware of three other electronic collections of Strindberg’s works: Projekt Runeberg, Litteraturbanken and Språkbanken. While these are valuable resources, SUSC is an important addition because, unlike the first two, it is linguistically annotated, and unlike the third, the data is available for download and thus can be fully inspected and processed using the researcher’s software of choice. Even more importantly, researchers can add their analyses as new layers of annotation of the corpus.
The 18th International Strindberg Conference. Stockholm University, May 31--June 3, 2012.