Learner CorporaThe New England Corpus of Heritage and Second Language Speakers (NECHSLS) is an online repository of oral and written production of heritage and L2 speakers of Spanish and Portuguese in New England, with a special focus on communities from Massachusetts, Rhode Island, and Connecticut.
The Corpus of Czech as a Second Language, available in raw and tagged versions.
The Russian Learner Corpus (RLC) comprises texts produced by two categories of non-standard speakers of Russian: learners of Russian as a Foreign language and speakers of Heritage Russian.
Russian learner translator corpus contains multiple translator trainees’ target texts aligned with their source texts at sentence-level. The translations are either from English into Russian or vice versa.
The project of Arabic Learner Corpus (ALC) aims to provide a collection of written and spoken materials produced by learners of Arabic in Saudi Arabia. Requires registration.
The Estonian Interlanguage Corpus (EIC) of the Tallinn University is a collection of written texts produced by the learners of Estonian as a second and foreign language (L2).
The Advanced Finnish Learners’ Corpus is a longitudinal essay corpus with texts written by students learning Finnish in MA courses.
This corpus has been compiled from the data of the Finnish National Foreign Language Certificate examinations. Three levels: Basic, Intermediate and Advanced.
A French cross-sectional corpus containing oral tasks performed by learners in years 9, 10 and 11, (age 13, 14 and 15 respectively, after 2, 3 and 4 years of learning French) on a one-to-one basis with a researcher.
The MERLIN corpus contains 2,286 texts for learners of Italian, German and Czech that were taken from written examinations of acknowledged test institutions. The exams aim to test knowledge across the levels A1-C1 of the Common European Framework of Reference (CEFR).
VALICO is an Italian international learner corpus freely available and searchable online. Much of the corpus instructions are provided in Italian.
CEDEL2 is an L1 English - L2 Spanish learner corpus; it investigates how English-speaking learners acquire Spanish grammar (morphology and syntax.)
The Corpus of Taiwanese Learners of Spanish (Corpus de Aprendices Taiwaneses de Español, CATE).
Learner corpus of Spanish as a Foreign language (L1 English); includes multiple levels from beginners to advanced level, as well as a native speaker control group.
Spanish learner oral corpus. Includes narratives, task completion, and interviews.
TS Wikipedia is a collection of approximately 1.6 million processed Turkish Wikipedia pages. The data is tokenized and includes part-of-speech tags, morphological analysis, lemmas, bi-grams and tri-grams.
Two Arabic corpora, one of which contains 5000 articles with nearly 3 million words from an online newspaper; another one contains 20000 articles on the topics of culture, religion, economy, local news, international news and sports.
The International Corpus Network of Asian Learners of English (ICNALE) includes 1.8-million-words of controlled L2 English speeches and essays by more than 3,500 college students in ten countries and areas in Asia as well as L1 productions by 350 English native speakers.
Native language Corpora
English:British National Corpus contains texts collected from 1980s through 1993; 100 million words.
Corpus of Contemporary American English consists of texts of various genres created from the year 1990 to 2012; 450 million words.
Michigan Corpus of Academic Spoken English (MiCASE); freely available, with online search function.
Chinese:The Lancaster Los Angeles Spoken Chinese Corpus (LLSCC) is a corpus of spoken Mandarin Chinese. The corpus is composed of 1,002,151 words of dialogues and monologues, both spontaneous and scripted, in 73,976 sentences and 49,670 utterance units (paragraphs).
German:Corpus of the Berlin-Brandenburgischen Akademie der Wissenschaften (DWDS Core Corpus), upon which the Digitale Wörterbuch der deutschen Sprache des 20 Jahrhunderts (DWDS) was created.
Deutscher Wortschatz Online contains 500 million words.
The Hamburg Dependency Treebank; the largest dependency treebank available; consists of dependency annotations, based on sentences sourced from the German news site heise.de, from articles published between 1996 and 2001.
Corpora of the Institut für Deutsche Sprache. World's-biggest collection of German-language corpora used for empirical linguistic research.
LIMAS-Korpus is representative corpus of written contemporary German language of the 1970s: 500 texts or fragments, various text genres with a total of 1 million word forms. Searchable online.
Korpus Südtirol contains texts representing South Tyrolean German.
Russian:Russian National Corpus
The Helsinki Annotated Corpus of Russian Texts (HANCO). 100, 000 running words, extracted from a modern Russian magazine; corpus will include morphological, syntactic, and functional information.
Open Corpus of the Russian language; under construction and invites participants.
Stories about dreams and other corpora of spoken language; contains only spontaneous informal spoken discourse, with few texts produced by few authors, four subcorpora range from 5,000 to 14,000.
Corpus of Russian fiction.
Computer corpus of texts retrieved from newspapers of the late 20th century.
French:Corpus de Référence du Français parlé; 440,000 words, 134 recordings, over 36 hours of spoken language.
Un corpus d’entretiens spontanés contains 95 conversations/speakers.
Croatian:Croatian National Corpus
Hungarian:The largest Hungarian language corpus, available in its entirety under a permissive Open Content license.
Icelandic:Covers the Icelandic language from 12th Century to modern times.
Persian:Persian corpus in public domain.
Italian:Link to a list of Italian corpora compiled by Institute of Cognitive Sciences and Technologies.
Corpus di Italiano Scritto (CORIS); 100 million words.
Banca dati dell'italiano parlato (BADIP); contains various corpora of spoken Italian.
Tatar:Corpus of written Tatar with over 116 million word occurrences.
Basque:XX Century Basque language corpora.
Other:A sizable collection of links to non-English language corpora. List of corpora through Stanford University.
The Romance Phonetics Database (RPD) is an on-line research and teaching tool containing tagged sound samples (both individual words and passages) illustrative of various segmental and prosodic aspects of Romance phonetics and phonology.
Lexical database in English, German and Dutch: speech synthesis, pronunciation modeling, parsing.
Catalan:Corpus del català contemporani is a corpus of contemporary colloquial Catalan.
Swedish:The Bank of Swedish a linguistic reference databank at the University of Gothenburg.
Czech:Cesky Národní Korpus (CNK) is the Czech national corpus.
Turkish:Turkish National Corpus
Portuguese:O Corpus do Portuguese is a corpus of 45 million words, 50,000 texts published between the 14th and 20th century. Lemmas and POS are annotated. A powerful web interface allows searching for information according to texts, registers, dialects, time periods. Also possible are statistical calculations based upon the search results.
Tycho Brahe Parsed Corpus of Historical Portuguese; syntactically annotated; downloadable.
Spanish:Corpus del espanol (RAE) is a diachronic corpus of Spanish; consist of texts created from 1200s to 2000.
Multiple Languages:Parallel corpus consisting of European Parliament proceedings in multiple European languages.
Search in 233 Corpus-Based Monolingual Dictionaries for 219 Languages.
RuN-Euro Corpus is an arallel corpus of European languages. The texts are aligned at the sentence level and have been tagged for grammatical information at the word level.