What is a corpus?

A corpus is a collection of texts. More specifically, in the words of Sinclair, it is "a collection of naturally-occurring language text, chosen to characterize a state or variety of a language" (1991, p. 171). In addition to this illustrative quote, there is today a growing consensus that a corpus is a collection of machine-readable authentic texts sampled to be representative.

Firstly, to say that they are machine-readable in effect means that the texts can be manipulated and searched with the help of a computer, using some kind of specialised interface. Secondly, to say that the texts are authentic means that they have been taken from original sources of written and spoken language, such as published books, periodicals, reports, lectures, talks, meetings, speeches, sermons, and sport commentaries. Finally, to say that they are representative means that the collected texts should ideally represent a particular language variety.

The language texts of a corpus are thus normally assembled with particular purposes in mind. For example, the British National Corpus (BNC) is a multi-purpose corpus consisting of approximately 100 million words. One of the main aims of the construction of the corpus was to create a material that would reflect contemporary British English in its various social and generic uses (Kennedy 1998; Meyer 2002).

The majority of the BNC consists of written British English material (about 90 per cent), but there is also a smaller part made up by spoken British English material (about 10 per cent). The material is effectively divided into 4124 so-called documents, where each document contains a sample of either written texts, or transcribed spoken discourse, and where a variety of different genres are represented.

Most samples contain between 40,000 and 50,000 words (Aston & Burnard 1998, p. 28). The written material was collected between 1960 and 1993, but no data are given as to when the spoken material was recorded.

Concordances

When using a corpus, it is very common to retrieve concordances. Concordances, or concordance lines, as they are also called, are a compilation of examples that a computer can present as a result of a search that we have specified. The good thing about concordance lines is that it is possible to get a context in which a word or a phrase occurs.

The size of the context can vary from a number of words on each side of the node word (the word that is searched for) to a whole sentence or even several sentences. This makes it possible to see how a word is used in an authentic context.

For example, a writer of a text may want to know how a particular word is used in English. Let us assume that it is not clear which preposition should be used with the word interested. From a Swedish contrastive perspective, the Swedish construction would be intresserad av. Does this mean that the phrase interested of should be used in English?

We can use a concordancer – a programme with which we search the corpus – to find this out by entering our word in the dialogue box of the concordancing software. As a response to our search, the programme will display a set of concordance lines which, according to our settings, will show the word we were looking for together with a specified amount of context.

The example below shows ten concordance lines from a search for the sequence interested, followed by a preposition. The example has been taken from the Corpus of Contemporary American English (COCA) (Davies 2008).

Example: Concordance lines (click to expand/contract)

escapism are at work, we are no longer interested in the source of our guilt and,

 movement. Some of them were mostly interested in the new, stripped down, 

  aesthetic forms, but others were more interested in housing for workers. Catherine

ing classes. It was a Mecca for anyone interested in good planning and in sensible

 women; but, as the story begins, he is interested in an adolescent, Olivier, whom

        sort " of workers, " steady, thrifty, interested in the improvement of their order "

me as well as their ideas. He was also interested in " science, ideas, and news

      students that the new dean is truly interested in actively soliciting their opinions.

     to study for the doctorate, and was interested in a change in academic scenery.

       This learned person is particularly interested in the scholarly contemplation of 

(Corpus of Contemporary American English) (Davies 2008)


The ten concordance lines shown in the example above all indicate that the preposition in seems to be commonly used with the adjective interested. Furthermore, a feature in the particular corpus used in the example (COCA) allows us to also retrieve frequency values for the searches we make.

For example, the programme can tell us how many instances of interested in there are in the corpus, compared to instances of the word interested followed by any other English preposition. The example below supplies the figures for a search for the sequence interested + any word classified as a preposition in the corpus texts.

Example: Frequency values retrieved in a corpus search (click to expand/contract)

INTERESTED IN     22733
INTERESTED AT     27
INTERESTED BY     25
INTERESTED FOR     20
INTERESTED ABOUT     16
INTERESTED ON     11
INTERESTED FROM     10
INTERESTED TO     10
INTERESTED BECAUSE     10
INTERESTED AS     10
INTERESTED AFTER     7
INTERESTED WITH     5
INTERESTED THROUGH     4
INTERESTED DURING     4
INTERESTED DESPITE     3
INTERESTED BUT     2
INTERESTED BEFORE     2
INTERESTED ABOVE     2
INTERESTED WITHOUT     2
INTERESTED THROUGHOUT     2
INTERESTED UP     2
INTERESTED TOWARD     1
INTERESTED OUT     1
INTERESTED OF     1
INTERESTED INTO     1

(Corpus of Contemporary American English) (Davies 2008)


As can be seen in the above example, the preposition in overwhelmingly dominates the scene with 22,733 instances in which it follows the word interested. Far behind comes the second most frequently used preposition at, with 27 instances. We can safely conclude that the most frequent preposition following the word interested is in.

However, we also see that other prepositions have been used, but these occur only in a very small number and some of them might even be incorrect uses, or cases in which the preposition is not connected to interested, but to the folllowing phrase.

Collocations

The term collocation is typically used to describe the frequent co-occurrence of words in a text. Through hundreds of years of language use, certain combinations of words become conventionalised. This means that native speakers of a language tend to use a limited number of preferred ways in which a certain situation, event or phenomenon is described. Thus, even though a language in theory offers a large number of possible word combinations, it seems that only a small number of these are actually used. 

Collocation is important since it tells us what word combinations are frequently used in a language. For example, if we happen to have a headache and want to tell other people about this, we may choose to say that we simply have a headache. However, headaches can differ in intensity so we might want to modify the word headache with an adjective. The question, then, is how do native speakers of English typically describe headaches? A search in a corpus can tell us what the most common adjective collocates of the word headache are.

major headache            15

bad headache               11

severe headache           11

throbbing headache        8


The above word combinations (adjective + noun) are examples retrieved from a search in The British National Corpus. We learn that a headache can be major, bad, severe or throbbing. The most frequent collocation seems to be major headache. If we look more closely at the list of possible collocates we also find examples like:

 

splitting headache            6

slight headache                4

terrible headache              3

 

When investigating what word combinations (collocations) are frequently used, it is common to inspect something called concordance lines, through which it is possible to see the exact contexts in which the collocation occurs in the corpus.

Caveat: The limitations to what a corpus search can tell us

Clearly, using a corpus when looking for answers to questions about English words and grammar is a great method. However, like most methods it has its flaws and disadvantages. Lindquist (2009, p. 10) raises a number of caveats:

  • Since the number of possible sentences in a language is infinite, corpora will never be big enough to contain everything that is known by a speaker of a language.
  • Some of the findings may indeed be trivial.
  • The intuition of a native speaker will always be needed to identify what is grammatical and what is not.
  • Corpora [may] contain all kinds of mistakes, speech errors etc. which may have to be disregarded.

On the whole, then, it is important to remember that just because something you search for does not exist in a specific corpus, it does not mean that that particular word or phrase does not exist at all in the language at hand. Conversely, just because you find an example of a phrase in a corpus, it does not mean that it is used by native speakers of that language. This is especially so when the phrase occurs only one or a couple of times. It could be that this is a case of a mistake or error.