Corpus criteria
  • How many words does a corpus need to contain?
  • Where should the texts to be included come from?
  • In what ways should large and small corpora be treated differently?
Making a corpus

As your corpus will only be small, you should have a principle to guide your selection of texts. For example:

  • News stories about a particular topic
  • Front page news stories for a particular month
  • University Student Union pages
  • Charity website homepages

In the following example, we are using history articles taken from University websites. The advantage of using the Internet as a source for data is that it is already in electronic form. If you have software on your computer that can read scanned text, then you could scan newspapers or magazines, but this is more time-consuming.

In order to create a corpus you should:

1. Create a folder on your C-drive called CALPER corpus and within it a word document with an appropriate name eg. university history articles

2. Now collect your data. For example, go the Penn State website and then to the history page. On the drop down edit menu, click select all, and then click copy, or press 'control' and 'c' on your keyboard simultaneously.

3. Open the word document and choose 'paste special' and then 'unformatted text' from the drop down edit menu. By choosing this option, you will avoid copying image files.

4. Repeat this operation on a number of other websites. Just keep pasting into the same document; your cursor will automatically be at the end of the document each time.

5. To check that you have enough words, choose word count from the drop down 'tools menu'.

6. When you are ready, save your document as a plain text file. Do this by choosing the 'plain text option' after clicking 'save as'. You don't need to change the document's name.

7. Now drag and drop the word file to your desktop, so that all that remains in your corpus folder is your text file. You do not need word files for your corpus analysis.

You have now made a very basic corpus. It contains only raw data as you have not annotated or tagged it in any way. However, you will be able to use it when you have been introduced to the necessary corpus-handling software.

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   overview  |   background  |   applications  |   analysis  |   the classroom  |   materials  |   the future
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies