Non-English Corpora
Arabic | |
| Corpus of Contemporary Arabic | ccccc under construction |
Chinese | |
| Callhome | contains transcriptions from 120 unscripted telephone conversations between native speakers of Mandarin Chinese. through Linguistic Data Consortium |
| Lancaster Corpus of Mandarin Chinese (LCMC) | The corpus is designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB) and provides a valuable resource for contrastive studies between English and chinese as well as a sound basis for monolingual investigations of Chinese.
free download |
| Sinica Corpus | Academia Sinica Balanced Corpus of Modern Chinese. 5-million-words with part-of-speech tagging ccccc |
| UCLA Corpus of Written Chinese | designed as a Chinese counterpart to the Freiburg-LOB Corpus of British English (FLOB) and the Frown corpus. It is also a recent update of the Lancaster Corpus of Mandarin Chinese. free for non-profit making research |
French | |
| More information and additional resources from Betsy Kerr | |
German | |
| COSMAS/Mannheimer Corpus Collection | ccccc ccccc |
| DWDS Corpus | corpus of the German language of the 20th century contains a variety of texts representing different text types from the whole of the 20th century. The corpus is accessible on-line, and consists of 102 million words of running text
cccccc |
| NEGRA Corpus Version | 355,096 tokens (20,602 sentences) of newspaper text from the Frankfurter Rundschau . Part-of-Speech tagged (and some other annotations) ccccc |
Italian | |
Japanese | |
Russian | |
| Russian Corpora in Tübingen | Online access to several corpora including "The Corpus of Interviews" ccccc |
| Uppsala Russian Corpus | compiled by Uppsala University, Sweden. Consists of some 600 Russian texts with a total of 1-million running words (word tokens), equally divided between informative and literary prose. The files are downloadable. Online access through the University of Tübingen site |
Portuguese | |
| Linguateca | contains several links to ongoing projects related to both oral and written data. ccccc |
Spanish | |
| Base de Datos Sintácticos del español actual (BDS) | A corpus annotated for syntax. Includes mostly written and some oral texts. ccccc |
| The COLA Corpus (Corpus Oral de Lenguaje Adolescente) | a corpus of informal Spanish youth language from Madrid and other capitals of Spanish speaking countries. The project started in 2002 by the initiative from Annette Myre Jørgensen at the Department of Romance languages at the University of Bergen and Anna-Brita Stenström at Department of English, and it is funded by the Faculty of Arts at the University of Bergen and the Meltzer fund. |
| CREA Oral | El Corpus de referencia del español actual (CREA) |
| COREC (Corpus de Referencia del español contemporáneo) | Collected between 1991-1992 at the Universidad Autónoma de Madrid. Contains 1,100,000 words. |
| Real Academia Española - Corpus de Referencia del Español Actual | CREA contains both texts and transcribed oral data, but the concordancer only allows you to search for exact words and phrases. |
| Real Academia Española - Corpus Diacrónico del Español (CORDE) | Similar to CREA but containing only written texts, up to 1975. |
| Corpus del Español | Texts from 1200s to 1900s. Created by Mark Davies, Brigham Young University. 100+ million words. Online. Free access. |
Learner Corpora: | |
| Spanish Learner Language Oral Corpus (SPLLOC) | Samples of spoken Spanish by 60 instructed learners (beginning, intermediate, advanced) with L1 English. Project directed by Rosamond Mitchell and Laura Dominguez, University of Southampton, UK. |
| C-ORAL-ROM | aims to provide the linguistic community and the HLT community with a comparable set of Spoken Language Corpora for the main Romance Languages, namely French, Italian, Portuguese and Spanish. Available from John Benjamins (John Benjamins Publishing Company) |
| COMET Project | corpus of Learner English and Learner German (as well as Italian and Spanish), collected at the University of São Paulo, Brasil. |
See David Lee's Bookmarks for additional corpora .
