Assessing vocabulary used in course wikipages in terms of Common European Framework of Reference for Languages (CEFR) level

Community Champion

As a former professor, the following article inspired me to look at the vocabulary used  in some Canvas course rooms in terms of the Common European Framework of Reference for Languages (CEFR) levels:

Studie: Vissa universitetslärare inte bättre än gymnasieelever på engelska - Universitetsläraren (un...

Teachers’ receptive and productive vocabulary sizes in English-medium instruction (

A program to extract the vocabulary used in a course room and a program to prune the list of "words" and add information about CEFR levels. Note that this is primarily directed at courses in American English (with some words in Swedish - this support is very limited at present). The vocabulary is based on a third-year course in Internetwork, a 4th-year course in Voice over IP, a 4th-year course in research methodologies and scientific writing, a course in accelerated computing, and a course in data science. [The last two courses are based on material provided by Nvidia's Deep Learning Institute under a CC-BY license that has been converted into Canvas course rooms]. The courses have accompanying videos and these have been captioned and wikipages created for each PowerPoint slide - with the corresponding transcript from the video added to the wikipage. As a result, the input material includes a very large portion of the course content. It does not include content in files, quizzes, etc.

Hopefully, with some automated feedback to the teachers, the accessibility of the course material can be increased. It might even be useful for students to know the distribution of CEFR levels for the vocabulary in a course they are considering taking.

The first program is it takes a course_id as the only argument.  The second program is and it also takes a course_id as the only argument.

The code can be found at

In addition to outputting a number of files, the prune program gives a summary in the form shown below (for a course that has 187,072 words in it):

Loading some directories
2999 entries in American3000
2003 entries in American5000
7459 words in common_English_words
376 words in common Swedish_words
Pruning the input
10540 unique words - initially 
10383 words left, 157 place names removed
10326 words left, 57 misc_words_to_ignore removed
10229 words left, 97 company_and_product_names removed
10208 words left, 21 abbreviations_ending_in_period removed
10206 words left, 2 common_programming_languages removed
10079 words left, 127  domainnames removed
9738 words left, 341  improbable words removed
1688 likely acronyms
7937 unique words after filtering acronyms and single letters
7936 unique words after filtering if there is a capitalized and lower case version of the word or title case turn to lower case
7838 words left, 98 top_100_English_words removed
7197 words left, 641 thousand_most_common_word_in_English removed
5744 words left, 1453 Oxford American 3000 words removed
5061 words left, 683 Oxford American 5000 words removed
2435 words left, 2626 common English words removed
2205 words left, 230 common_swedish_words removed
2220 words left, 15 words added after processing words that appear in title case
1565 starting with a capital letter (70.50%)
638 starting with a lower case letter (28.74%)
17 starting with other letter (0.77%)
Some statistics about the CEFR levels of the words as determined by the four main data sources
The totals are the total numbers of the input words in this source.
The percentage shown following the totals indicates what portion of the words from this source used in the course pages.
The American 3000 and 5000 sources have an explicit column of plurals; the rest are considered "singular".
The level xx indicates that the word does not have a known CEFR level.
American 3000: total: 2012 (67.09%), singular: 1554, plural: 460
singular: {'A1': 543, 'A2': 443, 'B1': 305, 'B2': 263}
  plural: {'A1': 520, 'A2': 442, 'B1': 303, 'B2': 261}
American 5000: total: 600 (29.96%), singular: 482, plural: 119
singular: {'B2': 188, 'C1': 294}
  plural: {'B2': 188, 'C1': 294}
common English words: total: 2784 (37.32%)
{'A1': 163, 'A2': 166, 'B1': 295, 'B2': 311, 'B2x': 1, 'C1': 167, 'C2': 98, 'xx': 1582}
common Swedish words: total: common_swedish_words_count=188  (50.00%)


Labels (2)