cancel
Showing results for 
Search instead for 
Did you mean: 
matt_price
Community Participant

protect html entities from server-side escaping?

I'm trying to post HTML content pages (e.g. syllabus) that contains HTML entities such as html5 checkmarks (✓).  The content posts successfully, but the leading "&" is expanded server-side to "&and", I think.  I've taken a look at the JSON I'm posting and it seems the entities are unexpanded in that file.  Is there a standard trick for avoiding this? otherwise I guess I will have to just remove these characters from my syllabus, is that right?

Labels (2)
Tags (2)
10 Replies
robotcars
Community Champion

Are you sending the data as a payload or appended to the query string?

Can you share the code or payload you're trying?

matt_price
Community Participant

I have this string saved in ~/syl-pp.json:

-------------------

{
    "course": {
        "syllabus_body": "<table><tr><td class=\"org-left\">&check;</td>\n<td class=\"org-left\">&check;</td>\n<td class=\"org-left\">&check;</td>\n<td class=\"org-left\">&check;</td>\n</tr></table>n",
        "is_public": "nil",
        "grading_standard_id": 15,
        "license": "cc_by_nc_sa",
        "default_view": "syllabus",
        "license": "cc_by_nc_sa"
    }
}

-------------------

I post with this command:

-------------

curl -X PUT  -d @/home/matt/syl-pp.json  -H 'Authorization: Bearer API-SECRET'  "https://uni-baseurl/api/v1/courses/64706"  --header "Content-Type: application/json"

-------------

The final result is wrapped in a bunch of other HTML of course, but that snippet exports to this:

--------------

<table><tbody><tr>
<td class="org-left">&amp;check;</td>
<td class="org-left">&amp;check;</td>
<td class="org-left">&amp;check;</td>
<td class="org-left">&amp;check;</td>
</tr></tbody></table>

---------------

If I go in in devtools and change each &amp;check; to &check; they display as checkmarks, so I don't think the issue is in the browser.  I imagine this is a standard escaping issue that I ought  to be able to figure on my own, but cant...

I don't cURL very often... maybe pklove‌ knows?

It's also been awhile since I sent HTML to the API, but in Python, I UTF-8 encode the html string before sending.

Also check these flags, curl - How To Use 

pklove
Community Champion

I think its because Canvas doesn't like &check;.  If you try &copy;, then it will work fine.

I get the same if I try via the browser.

Maybe there is a list of allowed entities somewhere.

pklove
Community Champion

It looks like you can use the unicode number.

These work: &#x2713; &#x2714;

pklove
Community Champion

And more fun with &#x2611; &#x2705;

matt_price
Community Participant

ah yes thank you again Peter! I grabbed the whole W3C list from https://raw.githubusercontent.com/w3c/html/master/entities.json  and pasted them into a request.  It looks like only a small number of them are supported (I noticed arrows, a few math/logic symbols, and I think some playing card icons), but mostly they just rendered as "&Aacute" etc. I haven't figured out yet how to unlock a page so it's truly public, but will maybe paste in a reference here when I figure that out.

I set emacs to export checkmarks to the unicode versions, and will do the same thing with other problem cases I run into.  It's a bit of a bummer b/c I use my source code as a resource for my students & it would be nice for them to be able to read the symbol expressions... but it's a very small cost.

maguire
Community Champion

carroll-ccsd in an earlier post pointed to an IRC response that nokogiri is used to sanitize the HTML text, see HTML sanitation rules applied to HTML in submission body 

The courses controller (courses_controller.rb) has:

if params_for_create.has_key?(:syllabus_body)
      params_for_create[:syllabus_body] = process_incoming_html_content(params_for_create[:syllabus_body])
end

... eventually this ends up calling nokogiri to parse the HTML, then somewhere it is sanitizing the parsed HTML.

robotcars
Community Champion

I considered reposting that, but there's no definition of what will be scrubbed as far as characters.

The HTML white list for only shows tags that are allowed, not characters.

https://s3.amazonaws.com/tr-learncanvas/docs/Canvas_HTML_Whitelist.pdf 

I'd like to see if there's a difference between what the RCE will accept vs the API.

I'll look and ask around next week.