Error in link validator not handling relative links

maguire · ‎04-14-2021

Some time ago I wrote a program to compute an index for a course by walking the course pages and identifying key terms in the page, collecting all of the figure and table captions, all of the text that has been tagged as being in a language other than English, etc.

find_keyords_phrase_in_files.py and create_page_from_json.py

see the heading "Making an index" at https://github.com/gqmaguirejr/Canvas-tools and details of using it can be found at https://canvas.kth.se/courses/11/pages/indexing-a-course?module_item_id=232285

The result is a wikipage with entries of the form:

<ul>
<li>sockets API
<ul>
<li><a href="../modules/items/316319">Socket API</a></li>
</ul>
</li>
</ul>
<pre>

Note that the anchor HREF is to a relative location the "../" gets replace by the browser with the prefix for the page (https://canvas.kth.se/courses/21521) yielding the full URL https://canvas.kth.se/courses/21521/modules/items/316319

I had to use these relative HREFs to reduce the index for the course down to 3 pages due to the limited size of wikipages. This works perfectly well for the student in the course.

If I run the link validator from the page that says:

Course Link Validator

The course link validator searches course content for invalid or unreachable links and images.

The link validator reports:Found 14,610 broken links

Nearly all of the claimed broken links do in fact work.

When this problem was reported by my institution to Instructure the responses were:

Instructure support responded to the first message about this with:

>
> Thank you for contacting Canvas support! I'm sorry to hear you are
> experiencing some issues in Canvas. I would be happy to help you with
> that.
> Our Link validator often fails on valid links when content is copied.
> There is not a workaround for this. Canvas is aware of this issue and
> we have a team investigating this. I have attached your case to their
> projects so you will get updates on this issue. Please let us know if
> you have any additional questions.

Their second response was:

So it does appear that the instructor is right and there is an issue with the href and the link validator. Unfortunately our link validator is extremely literal in that it needs the full link and for it to go directly to the page or it flags the link as broken. In this case if we look at the HTML we can see the href reads ../modules/items/316861 for one of the links and this is valid this link works, the browser resolves the ../ or relative path by just adding it to the end of the current URL. This is also why in the link validator its getting flagged is because the link validator can't find the course for the relative since it doesn't work the same as the browser.

A link will also get flagged if there is a re-direct. If you have a working link for an article that used to live at "workinglink.com" will say, but they were bought out by "welikebrokenlinks.com" and all articles moved there, even though the browser knows to resolve this with a redirect the link validator doesn't and will flag it. The Link Validator is good for getting a general idea of which links may be broken but its not 100% accurate.

Looking at the code the ERB for the link validator gets invoked via ./app/views/courses/link_validator.html.erb

<%
content_for :page_title, join_title(t(:page_title, "Course Link Validator"), @context.name)
js_env :validation_api_url => api_v1_course_link_validation_url(@context)
js_bundle :course_link_validator
css_bundle :course_link_validator
%>

So in fact it knows the content in which it is to be validating the links.

Eventually the ruby code in ./lib/course_link_validator.rb gets invoked and for wikipages it does:

    # Wiki pages
    self.course.wiki_pages.not_deleted.each do |page|
      find_invalid_links(page.body) do |links|
        self.issues << {:name => page.title, :type => :wiki_page,
                   :content_url => "/courses/#{self.course.id}/pages/#{page.url}"}.merge(:invalid_links => links)
      end
    end

Thus it invokes the following two functions:

  # pretty much copied from ImportedHtmlConverter
  def find_invalid_links(html)
    links = []
    doc = Nokogiri::HTML(html || "")
    attrs = ['href', 'src', 'data', 'value']

    doc.search("*").each do |node|
      attrs.each do |attr|
        url = node[attr]
        next unless url.present?
        if attr == 'value'
          next unless node['name'] && node['name'] == 'src'
        end

        find_invalid_link(url) do |invalid_link|
          link_text = node.text.presence
          invalid_link[:link_text] = link_text if link_text
          invalid_link[:image] = true if node.name == 'img'
          links << invalid_link
        end
      end
    end

    yield links if links.any?
  end

  # yields a hash containing the url and an error type if the url is invalid
  def find_invalid_link(url)
    return if url.start_with?('mailto:')
    unless result = self.visited_urls[url]
      begin
        if ImportedHtmlConverter.relative_url?(url) || (self.domain_regex && url.match(self.domain_regex))
          if valid_route?(url)
            if url.match(/\/courses\/(\d+)/) && self.course.id.to_s != $1
              result = :course_mismatch
            else
              result = check_object_status(url)
            end
          else
            result = :unreachable
          end
        else
          unless reachable_url?(url)
            result = :unreachable
          end
        end
      rescue URI::Error
        result = :unparsable
      end
      result ||= :success
      self.visited_urls[url] = result
    end

    unless result == :success
      invalid_link = {:url => url, :reason => result}
      yield invalid_link
    end
  end

So we see that they did think about relative links (in the case of ImportedHtmlConverter.relative_url) - but not well enough.