Corpora in Languages Other Than English
Arabic
- Corpus of Contemporary Arabic - under construction.
Chinese
- Callhome Corpus contains transcriptions from 120 unscripted telephone conversations between native speakers of Mandarin Chinese. Available through the linguistic data consortium
- PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
French
- Minnesota Corpus of
Spoken French
Compiled by Betsy Kerr. The corpus is available as a word document. Write to Betsy Kerr, Department of French and Italian, University of Minnesota at Email: bjkerr@umn.edu. Consult her website "Bibliography and Useful Links for Data-Driven Language Learning" for other details. Site last updated: 2003 - Online Text and Concordancer Sites for French
Compiled by Betsy Kerr. Site last updated: March 2009 - PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
German
- NEGRA Corpus Version 2
355,096 tokens (20,602 sentences) of newspaper text from the Frankfurter Rundschau . Part-of-Speech tagged (and some other annotations). - COSMAS/Mannheimer Corpus Collection
- The DWDS corpus of the German language of the 20th century contains a variety of texts representing different text types from the whole of the 20th century. The corpus is accessible on-line, and consists of 102 million words of running text.
- PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
Italian
- PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
Japanese
- Japanese speech and text corpora gives information about The EDR Corpus contains has been obtained by collecting a large number of example Japanese and English sentences and analyzing them on morphological, syntactic, and semantic levels. The Japanese Corpus contains approximately 200,000 sentences, and the English Corpus contains approximately 120,000 sentences. Unfortunately, quite a few of the items listed don't have web links.
- PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
Russian
-
Russian Corpora in Tübingen
Online access to several corpora including "The Corpus of Interviews". - Uppsala Russian Corpus was put together by Uppsala University, Sweden. It consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The files are downloadable. Online access is now possible through the University of Tübingen site.
Spanish
- Base de Datos Sintácticos del español actual (Syntactic Database for modern ).A syntactically annotated corpus, including mostly written plus some oral texts.
- The COLA Corpus (Corpus Oral de Lenguaje Adolescente) is a corpus of informal Spanish youth language from Madrid and other capitals of Spanish speaking countries. The project started in 2002 by the initiative from Annette Myre Jørgensen at the Department of Romance languages at the University of Bergen and Anna-Brita Stenström at Department of English, and it is funded by the Faculty of Arts at the University of Bergen and the Meltzer fund.
- COREC (Corpus de Referencia del español contemporáneo). Collected between 1991-1992 at the Universidad Autónoma de Madrid, it contains 1,100,000 words.
- Real Academia Española - Corpus de Referencia del Español Actual (CREA)contains both texts and transcribed oral data, but the concordancer only allows you to search for exact words and phrases.
- Real Academia Española - Corpus Diacrónico del Español (CORDE) . Similar to CREA but containing only written texts, up to 1975.
- Corpus del Español Online. 100+ million words. Free access. Texts from 1200s to 1900s. Created by Mark Davies (Brigham Young).
- European Corpus Initiative Multilingual Corpus I (ECI/MCI): Dutch, French, Spanish, German and English parallel texts. Purchase.
- PolyU Language Bank, developed in the Department of English at Hong Kong PolyU, is a large archive of language corpora made up of a wide range of written and spoken texts totalling over 12 million words. Corpus searches can be performed using the Bank's built-in Web-based concordancer, enabling the easy use of corpus resources for language teaching and research. Different disciplines and text types are represented, including Academic, Business, Journalistic and Legal texts, and Literature. Both native speaker and learner data are available in the Bank but native data predominate.
- Spanish Learner Language Oral Corpus (SPLLOC)
Samples of spoken Spanish by 60 instructed learners (beginning, intermediate, advanced) with L1 English. Project directed by Rosamond Mitchell and Laura Dominguez, University of Southampton, UK.
C-ORAL-ROM aims to provide the linguistic community and the HLT community with a comparable set of Spoken Language Corpora for the main Romance Languages, namely French, Italian, Portuguese and Spanish. Available from John Benjamins (John Benjamins Publishing Company)
TRACTOR has an archive of corpus resources. Resources which have recently been deposited in the archive. Recent acquisitions include a corpus of Italian newspapers and Slovene fiction. In order to In order to access these resources you need to be a member of the the TRACTOR User Community, but if you deposit some resources you can have unlimited access to all their resources. Email queries to helpdesk@tractor.de
BASQUE spoken corpus is a collection of forty two narratives in the Basque language (Euskara) by native speakers. It includes sound files (MP3 format) and full detailed transcripts.
Portuguese contains several links to ongoing projects related to both oral and written data.
COMET Project : corpus of Learner English and Learner German (as well as Italian and Spanish), collected at the University of São Paulo, Brasil.
See also David Lee's Bookmarks.




