Chinese Corpus Resource Guide
Hongyin Tao, UCLA
13 Pages

A corpus (plural: corpora) is a principled collection of samples of natural language use, either written or spoken, which are usually stored as computer files. A written corpus can be gathered from a number of sources such as news media, literary works, or personal writings. A spoken corpus can be assembled from tape- or video-recorded narratives, interviews, conversations and the like, which would be transcribed into written texts. The size of a corpus can range from tens of millions of words to a few thousand. Larger corpora are usually required for big research projects such as writing dictionaries and major grammars, but so-called “mini corpora” consisting of several thousands of words can be extremely useful for language teachers. Once a corpus is built, we can use software tools to analyze it and produce word frequency lists, concordances and other useful types of output.

