- How many words does a corpus need to contain?
- Where should the texts to be included come from?
- In what ways should large and small corpora be treated differently?
As your corpus will only be small, you should have a principle to guide your selection of texts. For example:
- News stories about a particular topic
- Front page news stories for a particular month
- University Student Union pages
- Charity website homepages
In the following example, we are using history articles taken from University websites. The advantage of using the Internet as a source for data is that it is already in electronic form. If you have software on your computer that can read scanned text, then you could scan newspapers or magazines, but this is more time-consuming.
In order to create a corpus you should:
1. Create a folder on your C-drive called CALPER corpus and within it a word document with an appropriate name eg. university history articles
2. Now collect your data. For example, go the Penn State website and then to the history page. On the drop down edit menu, click select all, and then click copy, or press 'control' and 'c' on your keyboard simultaneously.
3. Open the word document and choose 'paste special' and then 'unformatted text' from the drop down edit menu. By choosing this option, you will avoid copying image files.
4. Repeat this operation on a number of other websites. Just keep pasting into the same document; your cursor will automatically be at the end of the document each time.
5. To check that you have enough words, choose word count from the drop down 'tools menu'.
6. When you are ready, save your document as a plain text file. Do this by choosing the 'plain text option' after clicking 'save as'. You don't need to change the document's name.
7. Now drag and drop the word file to your desktop, so that all that remains in your corpus folder is your text file. You do not need word files for your corpus analysis.
You have now made a very basic corpus. It contains only raw data as you have not annotated or tagged it in any way. However, you will be able to use it when you have been introduced to the necessary corpus-handling software.