I'm trying to post HTML content pages (e.g. syllabus) that contains HTML entities such as html5 checkmarks (✓). The content posts successfully, but the leading "&" is expanded server-side to "&and", I think. I've taken a look at the JSON I'm posting and it seems the entities are unexpanded in that file. Is there a standard trick for avoiding this? otherwise I guess I will have to just remove these characters from my syllabus, is that right?
I have this string saved in ~/syl-pp.json:
"syllabus_body": "<table><tr><td class=\"org-left\">✓</td>\n<td class=\"org-left\">✓</td>\n<td class=\"org-left\">✓</td>\n<td class=\"org-left\">✓</td>\n</tr></table>n",
I post with this command:
curl -X PUT -d @/home/matt/syl-pp.json -H 'Authorization: Bearer API-SECRET' "https://uni-baseurl/api/v1/courses/64706" --header "Content-Type: application/json"
The final result is wrapped in a bunch of other HTML of course, but that snippet exports to this:
If I go in in devtools and change each &check; to ✓ they display as checkmarks, so I don't think the issue is in the browser. I imagine this is a standard escaping issue that I ought to be able to figure on my own, but cant...
I think its because Canvas doesn't like ✓. If you try ©, then it will work fine.
I get the same if I try via the browser.
Maybe there is a list of allowed entities somewhere.
ah yes thank you again Peter! I grabbed the whole W3C list from https://raw.githubusercontent.com/w3c/html/master/entities.json and pasted them into a request. It looks like only a small number of them are supported (I noticed arrows, a few math/logic symbols, and I think some playing card icons), but mostly they just rendered as "Á" etc. I haven't figured out yet how to unlock a page so it's truly public, but will maybe paste in a reference here when I figure that out.
I set emacs to export checkmarks to the unicode versions, and will do the same thing with other problem cases I run into. It's a bit of a bummer b/c I use my source code as a resource for my students & it would be nice for them to be able to read the symbol expressions... but it's a very small cost.
carroll-ccsd in an earlier post pointed to an IRC response that nokogiri is used to sanitize the HTML text, see HTML sanitation rules applied to HTML in submission body
The courses controller (courses_controller.rb) has:
params_for_create[:syllabus_body] = process_incoming_html_content(params_for_create[:syllabus_body])
... eventually this ends up calling nokogiri to parse the HTML, then somewhere it is sanitizing the parsed HTML.
I considered reposting that, but there's no definition of what will be scrubbed as far as characters.
The HTML white list for only shows tags that are allowed, not characters.
I'd like to see if there's a difference between what the RCE will accept vs the API.
I'll look and ask around next week.