Generating an index and permitted attributes for <span>

maguire
Community Champion

In an effort to transform my course material from slides to web pages, I hope to make the material more accessible. One aspect of this has been adding "lang" attributes to mark the language that the material is in (in the hope of facilitating access by screen readers - so that they can use the appropriate language synthesis). Additionally, since the material is now in a format that is easier to process, I have also decided to generate various forms of indexes. However, one type of material that clutters up the indexes is due to source material that has in-line references. So, I tried the following construct to "hide" such material from the indexer:

<p>This is an in-line reference to hide from indexing, see <span class="inline-ref">Foo and Fee 'Elements of List Processing', Figure 0, on page 0</span> for an example.</p>

As far, as I can tell RCE leaves the class attribute alone - although this is not one of the "permitted" attributes, although <span> is an allowed HTML tag - according to https://s3.amazonaws.com/tr-learncanvas/docs/Canvas_HTML_Whitelist.pdf 

However, I am unsure if this will bite me later!  [This is my question for this posting.]

In addition to the 'inline-ref' class, I have also defined another class called 'dont-index' - so that I can explicitly mark text that I do not want to be index, but that is not an inline reference.

I also can exclude from the indexing text that is below a <hr> or <hr /> - as this is where I put notes and references on each page. Exclusion of this material and the two classes above are controlled by an option to the program that finds keywords and phrases.

An example of the output of the index for language tagged material along with figcaption and caption material can be found at https://kth.instructure.com/courses/11/pages/test-page-2?module_item_id=211651 

Note that the links to the actual material will not work unless you are within the institution, but it shows a general idea of how the indexing works.

When indexing groups of words split by stop words, one problem that I currently have is that the resulting file is a bit more than 2 MB in size and this is too large for a Canvas page. This is due to the maximum_long_text_length being 500 kilobytes -1 as pointed out by   @James   in Wiki Page: size limit. Now I have to figure out how to (automatically) logically divide the index into chunks no bigger than this.

The programs are 

cgetall.py - to get all the pages for the course to a local directory
find_keyords_phrase_in_files.py - to collect the keywords and phrases
create_page_from_json.py - to create a page with the index material
ccreate.py (note available - yet in the public github) to insert the resulting page

All of the programs are at GitHub - gqmaguirejr/Canvas-tools: Some tools for use with the Canvas LMS. 

0 Likes