Corpus tools

The software you will be using will allow you to generate:

Frequency lists (either in rank order, or sorted alphabetically)

Keyword lists (lists of words which occur with significant frequency in a text)

Concordances (examples of particular words or phrases in context)

We will consider frequency in this unit, and keywords and concordances in the next.

There are a number of freeware programs available, but for this illustration we will use Wordsmith Tools Version 4. The complete version of this software isn't free. For the purposes of this tutorial, however, you can download a demo version (which is free). It does everything that the full version does, but it will only give you a sample of the output. If you do decide to get the full version, all you have to do is pay and register and you will be able to upgrade the demo.

To download the demo go to the Mike Scott's website and follow the instructions.

You now have the tools that you require to continue with this tutorial. The next section will introduce you to the theory behind frequency lists.

Frequency lists

Before you do any analysis on your own corpus, it is useful to understand the theory behind the various operations that can be performed. The first one that we will consider is simply counting the number of times particular words appear in a corpus. It is virtually impossible to make accurate judgements about how frequently words or phrases occur in a corpus. You may, of course, feel sure that 'walk' occurs more frequently than 'perambulate'. But what about 'sprint' and 'jog', or 'tennis' and 'baseball'? A computer count can give you answers to these sorts of questions very quickly indeed. The frequency lists that are generated reveal the rank-order frequency with which words occur in the selected texts. Studying these frequency lists allow us to see ways in which one set of texts is different from another. Generally, frequency lists can be in rank order, or sorted alphabetically.

Frequency lists

Take some time to look through your original word document that you have made into your corpus. We have already said that it is very difficult to make judgments about frequency without the help of a computer, but let's try it with this small corpus.

Choose twenty words at random and try to put them in rank order of frequency.

Now you can generate a frequency list for your corpus see how accurate you were.

The instructions that come with the Wordsmith Tools program are very straightforward to follow. To make a frequency list, you first launch the WordSmith Tool software, and then click the WordList button, as illustrated in the following screenshot.

Now let's click Choose Text Now to open up the text selection window (step 2)

Now this is where we load our corpus files. Locate your CALPER corpus folder, and then move them to the right side of the window, and click the green check mark (step 3-5)

Finally, we click Make a word list now button, as shown in the screenshot below.

Voila! Here's your automatically computed frequency list (can also be sorted by alphabetical order):

Now, compare your manually generated frequency list with the computer generated one. How similar are they? Does anything surprise you?

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   overview  |   background  |   applications  |   analysis  |   the classroom  |   materials  |   the future
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies