Creating an Index

maguire
Community Champion
0
924

To follow up on my earlier question in Generating an index and permitted attributes for <span> this blog post contains some more information about generating an index from the pages in a Canvas course. A full description, script, and source code can be found under "Making an index" at GitHub - gqmaguirejr/Canvas-tools: Some tools for use with the Canvas LMS. 

Basically the process is based on creating in a local directory a copy of all of the HTML pages in a Canvas course along with some metadata about the module items in the course. Once you have the files, you can find keywords and phrases from the HTML and then construct the index or in my case a number of different indexes. I have split the process of finding keywords and phrases into two parts, the first works on the HTML files to find the strings in the various tags and stores this into a JSON formatted file - and the second part is part of the program computes the indexes. In this second program I started by splitting the text into words with a simple regular expression and then switched to using the Python NLTK package - specifically, the functions nltk.sent_tokenize(string1) nltk.word_tokenize(string2).

The resulting page (computed from ~850 HTML files) can be seen at Test page 3: Chip sandbox 

With regard to <span>, I found it useful to use them in three ways:

1. To keep a set of words together as a logical "word":

<span>Adam Smith</span> <span>Autonomous system number</span>

2. To mark text that I did not want to index:

<span class="dont-index">20% error rate</span>

3. To mark text as a reference (that I do not want to index):

<span class="inline-ref">(see Smith, Figure 10, on page 99.)</span>

Overall, the process of generating an index was useful - as I found mis-spellings, inconsistent use of various terms and capitalization, random characters that seemed to have been typos or poor alternative img descriptions, ...). It also was a nice forcing function to rework some of the content.

However, it remains a work in progress. I know that there are a number of weaknesses, such as not being careful in the final index to language tag entries and there is a need to remove some additional words that probably should not be in the index. Also, this is not a general-purpose natural language processing program - it could make better use of the NLTK package and it is very English language centric (in that it assumes the default language of the course is English, it does not pass the actual language information to the tokenization function, and it only contains stop words in English).