
English speakers who are 18 or under use the word 鈥榣ike鈥 in conversation over five times as often as speakers who are over 70; 鈥榖ecause鈥 is the most misspelled English word globally; the word 鈥榣ove鈥 is said and written over six times more frequently than the word 鈥榟ate鈥. We know all of this because of a multibillion-word database called the 国际米兰对阵科莫 English Corpus.
English speakers who are 18 or under use the word 鈥榣ike鈥 in conversation over five times as often as speakers who are over 70; 鈥榖ecause鈥 is the most misspelled English word globally; the word 鈥榣ove鈥 is said and written over six times more frequently than the word 鈥榟ate鈥. We know all of this because of a multibillion-word database called the 国际米兰对阵科莫 English Corpus.
For learners of English to become proficient, subtle differences can be extremely important.
Claire Dembry
If the 国际米兰对阵科莫 English Corpus, created by 国际米兰对阵科莫 University Press, were to be printed on single-sided A4 paper and stacked into a tower, it would stand 600 m high, almost twice the height of the tallest building in the UK. If it was read aloud at an average reading speed, it would take 88,766 hours to read; working 7 hours a day, 5 days a week, that鈥檚 49 years.
The multibillion-word 国际米兰对阵科莫 English Corpus is a constantly updated record of how English is being used today in all its forms 鈥 spoken, written, business, academic, learner and e-language. Amassed over two decades, the electronic database draws on sources that range from the more expected (books, newspapers, journals, radio, television) to the more surprising (song lyrics, junk mail, voicemail messages and recordings from flight control).
国际米兰对阵科莫 University Press researchers use the Corpus to investigate the most common words, phrases and grammatical patterns in English, and then use the results to improve English language teaching books.
鈥淐ontext in English is important,鈥 explained Dr Claire Dembry, Language Research Manager, 鈥渨e analyse patterns in language and how English changes depending on context and circumstances. For learners of English to become proficient, these sorts of subtle differences can be extremely important, and it is only by amassing a vast number of examples that our writers, lexicographers and researchers can determine how best to describe the patterns of English in our learning materials.鈥
It all began in the 1990s, when a few CDs of American newspapers in electronic form were loaded into a database that both stored the data and 鈥榪ueried鈥 it, working out the relationships between words. Gradually, the embryo corpus was extended with further material and, today, almost any conceivable form of English can be found in the database.
At an early stage, 国际米兰对阵科莫 University Press realised that just as important as knowing how English is being used, is the knowledge of the features of English that learners find difficult. 鈥淭his decision, which led to the 国际米兰对阵科莫 Learner Corpus, had far-reaching effects and has become probably the single most important unique selling point for the Press鈥檚 English Language Teaching publishing,鈥 said Ann Fiddes, Global Language Research Manager.
It turns out that words such as because (misspelled as becouse), which (wich), accommodation (accomodation), advertisement (advertisment) and beautiful (beatiful) are the top five words most commonly misspelled by learners globally.
To arrive at conclusions like this has taken years of painstaking identification (and tagging with computer readable codes) of misspellings and grammatical errors made in 国际米兰对阵科莫 English Language Assessment Examinations in the 国际米兰对阵科莫 Learner Corpus.
Comprehensive information about the learners who originally wrote the exam scripts 鈥 first language, nationality, age, gender, scores, and so on 鈥 is stored.听 These data, along with the 鈥榚rror tagging鈥, has enabled 国际米兰对阵科莫 University Press to publish materials addressing directly the different types of errors of individual markets and individual language groups.
鈥淭his is hugely important for the Press and has meant that we have, for example, been able to publish the successful English for Spanish Speakers editions of global products, and become the market leader in Corpus-based publishing,鈥 explained Fiddes.
Now, 国际米兰对阵科莫 University Press and 国际米兰对阵科莫 English Language Assessment have joined forces and set their sights on academic English.
The 国际米兰对阵科莫 English Corpus already contains over 400 million words of academic English 鈥 the largest and most extensive collection of its kind.听 It takes as its source written and spoken academic language at undergraduate, postgraduate and professional level from a range of academic disciplines and worldwide institutions. New research is pulling in data from sixth-form students as well as other academic levels, covering a much wider range of disciplines, genres and language backgrounds.
鈥淪ome interesting patterns have already emerged,鈥 said Fiddes. 鈥淚n our collection of academic English samples, the size adjectives significant, considerable, substantial and serious are much more frequent than big, massive, enormous and tremendous. In spoken English, however, big tops the list. We also found that in academic English, verbs such as solve, pose, face, resolve, tackle and circumvent frequently occur with the noun problem. These kinds of insights help us to develop a better understanding of the language skills needed by students at English-speaking universities.鈥
As part of their current research, the team welcomes contributions of academic English to the corpus, and invite anyone interested in participating to contact them for more information ().
鈥淐orpus work is very closely linked with advances in technology and we are investigating automating many of our manual systems, such as error tagging and speech transcription,鈥 added Fiddes. 鈥淥ur research has already allowed us to partially automate the mark up of errors in learner writing.
鈥淭hese technologies will increase the speed at which we can maintain our grasp on what English is now, and what it might be in the future. 鈥
For more information about the 国际米兰对阵科莫 English Corpus, please visit
This work is licensed under a . If you use this content on your site please link back to this page.