Non-English Corpora


Arabic

Corpus of Contemporary Arabic ccccc
under construction

Chinese

Callhome contains transcriptions from 120 unscripted telephone conversations between native speakers of Mandarin Chinese.
through Linguistic Data Consortium
Lancaster Corpus of Mandarin Chinese (LCMC) The corpus is designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB) and provides a valuable resource for contrastive studies between English and chinese as well as a sound basis for monolingual investigations of Chinese.
free download
Sinica Corpus Academia Sinica Balanced Corpus of Modern Chinese. 5-million-words with part-of-speech tagging
ccccc
UCLA Corpus of Written Chinese designed as a Chinese counterpart to the Freiburg-LOB Corpus of British English (FLOB) and the Frown corpus. It is also a recent update of the Lancaster Corpus of Mandarin Chinese.
free for non-profit making research

French

More information and additional resources from Betsy Kerr

German

COSMAS/Mannheimer Corpus Collection ccccc
ccccc
DWDS Corpus corpus of the German language of the 20th century contains a variety of texts representing different text types from the whole of the 20th century. The corpus is accessible on-line, and consists of 102 million words of running text
cccccc
NEGRA Corpus Version 355,096 tokens (20,602 sentences) of newspaper text from the Frankfurter Rundschau . Part-of-Speech tagged (and some other annotations)
ccccc

Italian

Japanese

Russian

Russian Corpora in Tübingen Online access to several corpora including "The Corpus of Interviews"
ccccc
Uppsala Russian Corpus compiled by Uppsala University, Sweden. Consists of some 600 Russian texts with a total of 1-million running words (word tokens), equally divided between informative and literary prose. The files are downloadable.
Online access through the University of Tübingen site

Portuguese

Linguateca contains several links to ongoing projects related to both oral and written data.
ccccc

Spanish

Base de Datos Sintácticos del español actual (BDS) A corpus annotated for syntax. Includes mostly written and some oral texts.
ccccc
The COLA Corpus (Corpus Oral de Lenguaje Adolescente) a corpus of informal Spanish youth language from Madrid and other capitals of Spanish speaking countries. The project started in 2002 by the initiative from Annette Myre Jørgensen at the Department of Romance languages at the University of Bergen and Anna-Brita Stenström at Department of English, and it is funded by the Faculty of Arts at the University of Bergen and the Meltzer fund.
CREA Oral El Corpus de referencia del español actual (CREA)
COREC (Corpus de Referencia del español contemporáneo) Collected between 1991-1992 at the Universidad Autónoma de Madrid. Contains 1,100,000 words.
Real Academia Española - Corpus de Referencia del Español Actual CREA contains both texts and transcribed oral data, but the concordancer only allows you to search for exact words and phrases.
Real Academia Española - Corpus Diacrónico del Español (CORDE) Similar to CREA but containing only written texts, up to 1975.
Corpus del Español Texts from 1200s to 1900s. Created by Mark Davies, Brigham Young University. 100+ million words.
Online. Free access.

Learner Corpora:

Spanish Learner Language Oral Corpus (SPLLOC) Samples of spoken Spanish by 60 instructed learners (beginning, intermediate, advanced) with L1 English. Project directed by Rosamond Mitchell and Laura Dominguez, University of Southampton, UK.
C-ORAL-ROM aims to provide the linguistic community and the HLT community with a comparable set of Spoken Language Corpora for the main Romance Languages, namely French, Italian, Portuguese and Spanish. Available from John Benjamins (John Benjamins Publishing Company)
COMET Project corpus of Learner English and Learner German (as well as Italian and Spanish), collected at the University of São Paulo, Brasil.


See David Lee's Bookmarks for additional corpora .

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   Corpus Searchable Bibliography  |   Journals  |   Corpora  |   Research  |   Organizations  |   Software
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies