Finland Swedish Text Corpus (FISC)
Department of Scandinavian languages and literature
FIN-00014 University of Helsinki
- Mirja Saari (Project Director)
- Jan Lindstrom (Principal Researcher)
- Juhani Birn (Researcher 1994)
- Hannu Aarnitukia (Researcher 1991-1993)
- Eva Hannus (Researcher Assistant 1991-1992)
- Maria Hagglund (Researcher Assistant 1995)
- Annika Salervo (Researcher Assistant 1993-1994)
- The corpus contains about 2.5 million running words of modern written
Swedish texts published in Finland (1990s). The kernel corpus is
accompanied by a minor section on spoken language.
- Following text types are included and structured as separate sections
of the corpus:
- literature (fiction)
- non-literary prose (non-fiction)
- official documents (legal texts, administration)
- casual conversations
- The texts in the corpus contain embedded tags indicating bibliograhic
details, headings, captions, paragrahps and other significant textual
features. The tags conform to TEI-recommendations.
- The corpus is located at the Department of Linguistic at the University
of Helsinki in a language corpus server (the UHLCS-server) running with a UNIX operating system.
- The language corpus server offers tools for processing the data and
search facilities including concordance listings (e.g. KWIC) and a morphological analyzer for Swedish texts (SWETWOL). Coventional unix-commands are,
of course, available as well.
- If you have access to Internet you are in practise able to contact the
language corpus server, be your personal terminal a PC, a Macintosh, or
a UNIX-machine. Preferred protocol is Telnet.
- Anyone who wishes access to the corpus is requested to contact the
project personnel. Users need a personal password to enter the University
of Helsinki Language Corpus Server.
Uses of the corpus
- The corpus will provide a database for a systematic mapping of modern
- The corpus will offer a body of natural linguistic data for advisory
purposes concerning grammar and usage.
- The corpus can be used as empirical reference material in lexicographic
- The corpus may offer information on language contact phenomena between
Finnish and Swedish in Finland.
- The corpus has been used as a basis for the development of software for
linguistic analysis (e.g. the SWETWOL-analyzer).
The project has been conducted by a consortium comprising:
- The Department of Scandinavian Languages (University of Helsinki)
- The Department of General Linguistics (University of Helsinki)
- The Finnish Research Centre for Domestic Languages
The Department of Scandinavian Languages (Nordica) is the Lead
Participant of the consortium.
The project started officially at Nordica in August 1991. The kernel
corpus of about 2,5 running words was established by the end of 1995.
- The kernel corpus of 2,5 million words may be complemented with a
monitor corpus of no finite size.
- There is good possibility to extend the section on spoken language.
Today, there is a material of about 80.000 running word consisting of
transcripts of speech. The material was originally collected and
transcribed in the project Swedish Coversations in Helsinki (SAM).
- Also a historic reference corpus (1600s, 1700s, 1800s) is beeing
planned. There is a material of about 200.000 words of 18th century
Finland Swedish but the processing of these texts is yet at a preliminary