In the previous unit we looked at how frequency lists are created and analyzed. The lists that we looked at focused on individual words only. However, it is also very useful to generate lists of clusters consisting of two or more words. We can ask a program like Wordsmith Tools to generate frequency lists for recurring clusters of up to eight words, although the longest cluster that is likely to occur with any frequency in a general corpus is the six word, much-maligned 'at the end of the day' â€“ see number two in the list below.
We can use cluster analysis to find out what the most frequent phrases are in a corpus. Here is a list of the most frequently occurring five-word clusters in a corpus of spoken English. You will notice that not all these clusters are actually well-formed phrases.
Some of these clusters (e.g. 'you know what I mean' and 'all the rest of it') are meaningful, and it is formulaic sequences such as these that that contribute to language that is judged to be fluent and well-formed. The insight that such frequency lists gives is therefore important in that it can inform teachers which phrases it could be sensible to teach learners.clusters
Please follow these steps to create a cluster with WordSmith Tools
STEP 1: In order to create a cluster, we first need to load our corpus data into
WordSmith Tools. This step is the same as generating frequency list (see screenshots below).
STEP 2: One important step in creating clusters in WordSmith Tools is to create and save
an index file for the corpus first. This is illustrated in the following two screenshots.
STEP 3: Now click compute >> clusters, and choose the size of the clusters you want, and the minimal frequency to filter out less interesting ones. Click OK, and Voila! You'll have your first cluster!!
Now create some clusters lists for your corpus. What sorts of clusters are most frequent?