In the previous unit we looked at how frequency lists are created and analyzed. The lists that we looked at focused on individual words only. However, it is also very useful to generate lists of clusters consisting of two or more words. We can ask a program like Wordsmith Tools to generate frequency lists for recurring clusters of up to eight words, although the longest cluster that is likely to occur with any frequency in a general corpus is the six word, much-maligned 'at the end of the day' – see number two in the list below.

We can use cluster analysis to find out what the most frequent phrases are in a corpus. Here is a list of the most frequently occurring five-word clusters in a corpus of spoken English. You will notice that not all these clusters are actually well-formed phrases.

Some of these clusters (e.g. 'you know what I mean' and 'all the rest of it') are meaningful, and it is formulaic sequences such as these that that contribute to language that is judged to be fluent and well-formed. The insight that such frequency lists gives is therefore important in that it can inform teachers which phrases it could be sensible to teach learners.


Please follow these steps to create a cluster with WordSmith Tools

STEP 1: In order to create a cluster, we first need to load our corpus data into
WordSmith Tools. This step is the same as generating frequency list (see screenshots below).

STEP 2: One important step in creating clusters in WordSmith Tools is to create and save
an index file for the corpus first. This is illustrated in the following two screenshots.

STEP 3: Now click compute >> clusters, and choose the size of the clusters you want, and the minimal frequency
to filter out less interesting ones. Click OK, and Voila! You'll have your first cluster!!

Now create some clusters lists for your corpus. What sorts of clusters are most frequent?

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   overview  |   background  |   applications  |   analysis  |   the classroom  |   materials  |   the future
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies