General corpora

A monolingual corpus: British English

The British National Corpus (BNC) was compiled by a consortium of British publishers of academic institutions such as Oxford University Computing services, Lancaster University's Centre for Computer Research on the English language and the British Library. Compiled in the late 1980s and early 1990s it is now a 100 million word corpus of modern British English, consisting of 90% written (informative prose and 'imaginative' texts) and 10% spoken (speeches, meetings, lectures, etc. and some casual conversation). The complete corpus is available on CD-ROM for research purposes. There is also a smaller subset of the corpus available on CD-ROM, consisting of one million words of both spoken and written texts: the BNC Sampler. This can be purchased over the internet and comes complete with various software packages.

A monolingual corpus: American English

The BNC influenced the creation of other large monolingual corpora such as The American National Corpus (ANC). This corpus is comparable to the British National Corpus (BNC), covering American English. The ANC will contain a core corpus of at least 100 million words, comparable across genres to the BNC. A first instalment of the corpus of 10 million is available for research and education for a nominal licensing fee.

A monolingual international corpus: varieties of English

The International Corpus of English (ICE) will ultimately be a collection of 1,000,000 word corpora from countries or region where English is spoken as a first language. The corpus consists of a written and a spoken component. Some components of the ICE corpus, such as a subcorpus of Phillipine or Singapore English are available free of charge - by either download or on CD-rom

A monolingual corpus: Spanish

To see an example of a general corpus not in English, go to the website for the The CORPUS DEL ESPAƑOL This is an excellent corpus of Spanish, created by Prof. Mark Davies of Brigham Young University. You can search the corpus online.

An international multilingual corpus

European Corpus Initiative Multilingual Corpus I (ECI/MCI). The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus (ECI/MCI) to be made available in digital form for scientific research at a low a cost as possible. The corpus has been available on CD-ROM since 1994, and is being distributed by ELSNET. It contains written texts in languages such as Spanish, German, French, Chinese and Albanian. A complete list of contents is available through the website.

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   overview  |   background  |   applications  |   analysis  |   the classroom  |   materials  |   the future
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies