Data driven learning

Corpus Linguistics has traditionally been concerned with the use of corpora for linguistic research, rather than directly with language teaching and learning. However, as recent research has shown (see recommended bibliography below), corpora can be exploited for pedagogical purposes as well (Partington 1998). This can be of benefit to both teachers and students. Teachers can use corpora (for example, concordance lines) to check on native speaker (other users') language use and to create examples and exercises based on authentic texts. Students can also be instructed on how to use corpora as a language learning resource by working with concordance printouts or other materials based on corpus evidence.

This type of hands-on language analysis where students are encouraged to use corpora directly is known as Data-driven learning (DDL), and it is an inductive approach to language learning based on the student’s ability to infer rules or generalizations about language form and use (see Johns 1991, Schmidt 1990, 1993). Within the DDL approach, students are often presented with printouts of concordance lines whose analysis allows them to identify patterns, work out their meanings and come up with their own hypotheses about language use, based on the evidence that they find in the printout.

This student-centered model has been widely described by one leading figure, Tim Johns, at the University of Birmingham, UK, who was among the first to advocate and to explore the use of corpora and concordancing in teaching.

If successfully employed, DDL can encourage independent learning as learners build strategies for identifying patterns and their meanings and come up with their own conclusions based on the evidence provided by the corpus.

An example of a DDL task:

1) 1)Read the concordance lines below and try to guess what the missing word is.

 at to try and sort of figure it all out in my _______ and she started trying to teach me 
    not quite clear hadn't got that clear in my _______. Three o'clock and seven o'clock 
how can I Now this may just be me going off my _______ sitting in front of Newsnight 
  he became ultimately the erm [tc text=pause] _______ of the careers service in London University.   
 at one stage and took on MX who is now I think _______of political science erm to do [tc text=pause] 
   cost yeah. [M01] Yes. Yeah. [F01] The deputy _______. What kind [M01] Mm. [F01] of role does she    
  it and FX had this terrific weight behind her _______and she I think she did twenty er sit ups with 
   I was a child I could not get this out of my _______. I cried at night with it going over and over  
 up smashing the salad-cream bottle er over her _______ and [tc text=pause] I didn't want to hurt her  
     them down my neck. I was smashed out of my _______. When I did eventually get back to the prison  
a lot of people er never it never entered their _______to think about these things. [M01] Really?     
[F01] Mm. [F02] able to perhaps with a clearer _______ approach [F01] Mm. [F02] the more sensitive    
 to say erm [tc text=pause] It's gone out of my _______. [F01] Right. Erm [F02] [ZF1] I th [ZF0] I        
       [M01] MX [M02] there was MX who was then _______ of the Woodlands Unit [M01] Oh was he. [M02]   
     Yeah. [M01] that they've g Well the deputy _______ [M02] [tc text=coughs] [M01] of curriculum erm 
[M01] Erm [tc text=pause] not off the top of my _______. No [F01] Okay. Well just to conclude may I    
  [tc text=pause] I'm not putting ideas in your _______ but smoking seems to be associated with these     
    you've got to have eyes in the back of your _______ I mean [F01] Mm [F02] having kids yourself you 
s [ZF1] the [ZF0] the head is he [M01] He's the _______ of the Coordination Centre but [F01] [ZGY]     
    it on.  [F01] I know. [F02] It just does my _______ in summat chronic. [F03] Yeah.  [ZGY] more     
             [F02] Eh if I do I'll be out of my _______ [M01] I'll get some more [F02]   


By now you probably have figured out that the answer is HEAD. What helped you to figure this out? Are there any idiomatic expressions containing the missing word in the concordance lines above? What do they mean?

Supporters of DDL have pointed out that one of the advantages of this approach is that it enhances and even accelerates vocabulary acquisition. The main focus of DDL is at the word level, which means that it allows for very focused vocabulary work (see Cobb 1997, 1999).

A concordance printout like the one above would allow English teachers to introduce idiomatic expressions such as:

To be out of one's head
To do one's head in
To go out of one's head
To have a weight behind one's head
To be in the back of one's head
To know something off the top of one's head

Learners would also be able to single out some metaphoric uses such as 'to be the head of an organization', or 'to get something into one's head'’ and 'to enter one's head', of which the latter two conceptualize head as a container.

Preparing students

DDL could put learners off if they do not get some guidance first. Most of us are used to reading from left to right, or right to left, or vertically, at least always in one direction, so strategic reading of the horizontal lines on a concordance printout will not come naturally to learners. They will need to be oriented to look at up and down the central column containing the key word or phrase instead of trying to read all the lines as sentences, and sometimes to read outward from the center of the screen either rightward or leftward.

Learners should also be told that the purpose of working with concordance lines is to be able to decide how patterns work. The immediate words to the left and right of the key word situated in the center of the printout/screen are what will allow them to figure out the meaning of that word or structure and to discover how collocations work. A very useful tool for corpora-based work on vocabulary is provided by Tom Cobb in his website Lextutor (

Example task for Spanish

Let's imagine that you are teaching an Advanced Spanish course, and you are trying to make learners aware of the use of discourse markers. The discourse marker oye is one that lends itself to avoidance, especially among English-speaking learners. Whereas they would have been introduced to it at an earlier stage (i.e. when learning how to ask for directions) together with its formal equivalent oiga, English-speaking learners tend to produce the forms disculpe/a, perdone/a in these situations, probably because they see a functional equivalence between them and excuse me, in English, whereas the direct translation of oye and oiga (lit. 'listen') would sound impolite in the same situation. It may be, for instance, that oye suggests a summons as informal and colloquial as Hey! in English, and is thus avoided as being too familiar for addressing a stranger.

In order to make students aware of how common the use of oye is in Spanish you could get them to do a corpus search of the word. In a corpus such as the conversation sub-corpus of COREC (Corpus oral de referencia del español contemporáneo; see, which contains 211,632 running words in total, we can find 276 examples of the word oye, most of which function as a discourse marker in conversation, either to attract the attention of the interlocutor, to mark intimacy / build closeness, or simply as a colloquial form in an informal interaction between speakers. Some of the examples were simply the conjugated form of the verb oir (as in concordance lines 1, 6 and 14 below).

N Concordance
1        , esto es un grupo que son todo tíos y una tía que canta.  [H3] Pues sólo se oye [ininteligible]  [H1] Sí. Sí, encima resultará...   [H3] Si la tía está bue
2       silencio]  [H2] Jo, ya es primavera. Cantan los pajarillos.  [silencio]  [H1] Oye, Simón. ¿Tú crees que los gráficos que estoy yo grabando  el [extranjer
3         no tiene a  nadie.  [H1] ¡Claro!, como casi todas [/simultáneo] las madres, ¡oye!  [H2] Pero bueno, es que... lo de ella es demasia[(d)]o porque  hay q
4          rdad es que, como te comenté, tengo otro compromiso.  Lo siento de verdad, oye. Muchísimas felicidades, que tengas un  viaje... estupendo. Y bueno, ya 
5           hacerlo, ¿eh?  [H3] Oye ¿ Y co[palabra cortada] coloca bien esto?  [H2] Oye que perdona que hemos llega[(d)]o tarde, ¿eh? pero es  que...  [H4] [
6        os con lo de... pues eso,  ver accesorios, cortinas, muebles. [ruido] ¿Se me oye? [risas]  [H1] [risas] Oye una cosa, y... el... ¿qué te iba a preguntar 
7        y la     vas pasando por el cuerpo, ¿no?     [H5]Hola.     [H4]Hola.     [H1]Oye, papa, ¿cuánto... costaron los acumuladores... de nuestras     habitacio
8         se equivoquen; que yo no  me muevo en... la ilegalidad; porque lógicamente, oye, yo sé que  todos éstos me dicen: "¿Y por qué los coches de Madrid... ?
9        asa que se habían comido un plato... que no lo encontrábamos  lo que... y yo "oye, que yo el menú lo tengo aquí". No lo  encontrábamos y resulta que se 
10       atan ¿eh?     [H2] Digo no, no, me enteraré bien, desde luego. [silencio]     Oye, tendremos, tendremos que esperar porque se anula[(d)]o la     reunión
11    rande del tío y  to[palabra cortada]... una entrevista y tal.  [silencio]   [H2] Oye, vamos a... a arrebufarnos pa[(r)][(a)] [(a)]cá. Que nos quiten el frío.  
12          megafonía, es que la sonoridad es mala.     [H2] Es mala. [silencio] No se oye bien, no se oye claro. Y la     información pésima, la información pésima.
13          5] Salía la gente [/simultáneo] espanta[(d)]a, es que es verda[(d)].  [H8] Oye, pues un gustito. [risas] No había... [simultáneo] no había  niños. No 
14         ue la sonoridad es mala.     [H2] Es mala. [silencio] No se oye bien, no se oye claro. Y la     información pésima, la información pésima. Lo mismo te di

Chambers and O’Sullivan (2004) and Chambers and Kelly (2004) provide a number of illustrations of how DDL can be used in the context of teaching French. In an article written in German, Braun and Chambers (2006) use concordance printouts in French, and English to illustrate how corpora can be used in the foreign language classroom. Chambers and Wynne (2008) explores the issues arising from the use of a corpus of a French journalistic corpus, with advanced learners. The following concordance is used by Chambers and Wynne to illustrate the collocation of the words question and POSER is in the journalistic corpus:

Comme personne ne lui posait de question à ce sujet, il s’est chargé lui-même de
au moment même où se pose la délicate question du nucléaire
Je me suis posé la question de savoir ce que j’avais mal fait
la qualité d’Arsenal. Se pose-t-on la question de savoir si le Bayern est un grand
Posons ainsi la question : le climat change-t-il ?
un bain de sang. Pouvait-il faire autrement ? La question reste posée
Un autre monde est-il possible ? la question se pose depuis la nuit des siècles
Ne serait-ce pas l’occasion de poser la question ?
Le maintien en détention provisoire pose également question.
Tout cela pose aujourd’hui la question : que reste-t-il de nos amours 
Ce bâtiment gigantesque posait évidemment la question de la consommation en énergie
Metaleurop est devenue par la force des choses un drame national, car il pose la question
d'une réglementation pour éviter une telle catastrophe 

Dodd (1997) explores the use of a corpus of written German for advanced language learning, and teachers interested in Italian can find a variety of examples of exercises in the section on corpora and concordancing in the web site of the LINGUA ICT4LT project;

Tim Johns and Joseph Rézeau and others use parallel corpora to create exercises which can be used in more than one language. For an example of English/Chinese concordancing, see

Using the DDL approach in teaching means providing the learner with examples of language as it is actually used in authentic contexts (see O'Keeffe, McCarthy and Carter 2007: 25-27). Teachers want to present their learners with authentic examples of the language, and in that sense it is not surprising that native speaker corpora are mostly used in DDL for teaching both grammar and vocabulary, as was pointed out above. However, Granger and Tribble (1998) suggest that combining native and learner corpus data can be a very useful way of focusing on learners’ errors. Allan (2008) suggests that a corpus of L2 simplified readers may be a good starting point for DDL with lower-level learners, since texts can then be chosen which are appropriate to the learners' level and which do not contain too much distracting lexically difficult material.

Finally we must return to the question of the difficulty and unfamiliarity of reading and working with concordances for most learners. A good deal of preparation and support from the teacher and regular, supervised classroom practice may be necessary before learners can feel comfortable with DDL activities and begin to see them as a natural and useful part of their language-learning experience.

NonDiscrimination Statement | Affirmative Action | Privacy Policy | Copyright Policy

© 2002-2012 CALPER and The Pennsylvania State University. All Rights Reserved.
   overview  |   background  |   applications  |   analysis  |   the classroom  |   materials  |   the future
The Pennsylvania State University CALPER South Asia Language Resource Center Center for Languages of the Central Asian Region National Capital Language Resource Center Center for Advanced Language Proficiency Education and Research National East Asian Languages Resource Center Center for Language Education and Research National African Language Resource Center National K-12 Foreign Language Resource Center Center for Advanced Research on Language Acquisition National Foreign Language Resource Center Center for Educational Resources in Culture, Language and Literacy Language Acquisition Resource Center National Heritage Language Resource Center National Middle East Language Resource Center Center for Applied Second Language Studies