Relevant Elephants: Fixing Canvas Commons Search

This idea has been developed and deployed to Canvas

I'm not really sure how to write this as a feature idea because, as far as I am concerned, it is a bug that needs to be fixed. But my dialogue with the Helpdesk went nowhere (Case #03332722), so I am submitting it as a feature request per their advice. I am not proposing how to fix this problem; I am just going to document the problem. People who actually know something about repository search would need to be the ones to propose the best set of search features; I am not an expert in repository search and how to fix a defective search algorithm. But as a user, I can declare that this is a serious problem.

If you search Canvas Commons for elephants, you get 1084 results. Here are the "most relevant" results:

https://lor.instructure.com/search?sortBy=relevance&q=elephant 

There are some elephants, which is good... but a lot of other things that start with ele- ... which is not good. Apparently there is some component in the search algorithm which returns anything (ANYTHING) that matches the first three characters of the search string.

Election Unit Test.

Static Electricity Virtual Lab.

And so on. And so on. Over 1000 false positives. ele-NOT-elephants.

most relevant elephant search

As near as I can tell, there might be a dozen elephants in Canvas Commons. I've found six for sure; there could be more... it's impossible to find them. Impossible because of the way that search works (or doesn't work). The vast -- VAST -- majority of results are for electricity, elections, elearning, elements, elementary education, electrons, and anything else that starts with ele.

You might hope that all the elephants are there at the start of the "most relevant" search results... but you would be wrong. There are 5 elephants up at the top, but then "Static Electricity Virtual Lab" and "Valence Electrons and Isotopes" etc. etc. are considered more relevant than Orwell's essay "Shooting an Elephant" (there's a quiz for that). I have yet to figure out why Static Electricity Virtual Lab is considered a more relevant search result for "elephant" than materials for George Orwell's Elephant essay which actually involves an elephant.

I found out about Orwell's Elephant this way: when I search for "Highest Rated," the top-rated elephant is Orwell's elephant. There are lots of other highest rated items at the top, though, which have nothing to do with elephants, and that is why you cannot see Orwell's elephant in my screenshot. It's below all these other items in the screenshot. But if you scroll on down, you will find Orwell's elephant essay. Eventually.

I found it using Control-F in my browser.

Here is the search URL:

https://lor.instructure.com/search?sortBy=rating&q=elephant 

highest rated elephant results (with no elephants)


Switch the view to "Latest" and all the elephants are missing here too. Really missing. Well, you'll get to them eventually I guess if you keep loading more and more... and more and more.. and more. But no one is going to scroll and load and scroll and load to find the elephants, right? 

Here's the search term: elephant. But the search results are for ele- like elementary mathematics, elementary algebra, "Abraham Lincoln Elementary's 5th Grade beginning of the year prompt," and " the elements involved in warehouse management," and so on.

https://lor.instructure.com/search?sortBy=date&q=elephant 

latest elephants... but there are no elephants

I hopefully tried putting quotation marks around the word "elephant" to see if that would help. It did not. 

The Helpdesk tells me that this is all on purpose in order to help people with spelling errors:

That is how our search engine was set up. To search as specific as it can get and then to gradually filter in less specific content. This is done so that if a word is misspelled the search is still able to locate it.

if I type "elphant," then Google search shows me results for "elephant." That sounds good. It corrected my typo. But Canvas Commons gives me no elephants if I type "elphant." Instead it gives me two things: an item submitted by someone named Elpidio, and something called "Tech PD and Educational Technology Standards" which involves the acronym ELP. So much for helping people with spelling errors. 

Electricity, elections, elements, elearning: these do not sound good. Those results are obstructing the search; they are not helping. There is nothing "gradual" about the filtering. Static electricity shows up as more relevant than George Orwell's elephant. Some kind of three-character search string is driving the algorithm to the exclusion of actual elephant matches.

If you assume that someone who typed ELEPHANT really meant to type ELECTRICITY or perhaps ELEARNING, well, that is worse than any autocorrect I have ever seen. And I have seen some really bad autocorrect.

This happens over and over again; it affects every search.

Want to search for microsopes? Get ready for a lot of Microsoft. These are supposedly the most relevant microscope search results, but the second item is Microsoft ... even though it doesn't seem to have anything to do with microscopes at all from what I can tell.

https://lor.instructure.com/search?sortBy=relevance&q=microscope 

Still, we're doing better than with the elephants here. There are a lot of microscopes in addition to Microsoft:

microscope search; most relevant

But look what happens if you want highest-rated microscopes. See the screenshot; there are no microscopes. It's Microsoft Microsoft Microsoft. But hey, there is also Of Mice and Men!

https://lor.instructure.com/search?sortBy=rating&q=microscope 

So, the search algorithm assumes that, while I typed "microscope" as my search term, I might really have meant to type "Of Mice and Men." Or Microsoft. Or the name Michael (a lot of content contributors are named Michael) (or Michelle).

highest rated microscopes

I could go on. But I hope everybody gets the idea. If this is really a feature of Canvas Commons search and not a bug (???), I hope this three-character search string "feature" can be replaced with a better set of search features.

Although I still would call this three-character approach to search a bug, not a feature. Which is to say: I hope we don't have to wait a couple years for the (slow and uncertain) feature request process to warrant its reexamination.

Comments from Instructure

For more information, please read through the https://community.canvaslms.com/docs/DOC-15588-canvas-release-notes-2018-10-27 .

51 Comments
laurakgibbs
Community Champion
Author

FWIW A friend of mine at Twitter described this problem as "overstemming." It is not a search feature. It is a search error. (Although I'm not sure that defaulting to the first three characters of the search term even counts as stemming of any kind really.)

James
Community Champion

On it's face, this sounds stupid.

Once you start to dig in deeper, it starts to make sense, although some tweaks to the algorithm may be beneficial.

This makes more sense to me than what I thought I would find when I mentioned it in another thread.

Orwell's elephant doesn't contain "elephant" in the title, it is in the description. Presumably a word that appeared in the title would make it more relevant than something that appeared in the description. The problem that jumps out at me is that they prioritize spelling mistakes over words in the description.

Canvas doesn't know from picture analysis that there is an elephant in the picture so they are limited to what people type in the description or tag it as. Maybe they could do something with a major cloud provider to analyze the picture and return relevant keywords and factor those in, but then someone could throw in a completely unrelated picture and skew the results.

The "users can't spell" argument is valid. Even if the person searching can spell, there's no guarantee that person who shared the content can. I'm reminded of a time in calculus class where were doing one of our daily quizzes, which always had a super-secret-code word that the students had to guess before they could get in. It was something related to the topic, but it might be a six degrees of separation kind of relationship. Anyway, I misspelled the super secret word, but so did every other student in the class. Luckily, they spelled it the same way I did and so it worked out.

There's also no guarantee that the person who shared the content and speaks the same language uses the same spelling. In Canvas, you don't get to pick English as your language, you have to choose Australia, Canada, United Kingdom, or United States. An American looking for something on color would miss out on the wonderful colour module uploaded by someone in the UK. But if you looked at just the first three letters and returned everything that matched col, you would get both. I looked at the american-english and british-english dictionaries on my server. The American version had 99,171 words, the British version had 99,156 words. There was an overlap of 97,423 words. That is 1748 words (1.8%) in the American dictionary that were either not in, or spelled differently in, the British dictionary. 

On the other hand, while Google may know that when you type elphant, you probably meant elephant, it rarely suggests that when you type color, you meant colour -- at least not based off your language settings. When I search for color blindness, I get different results than if I search for colour blindness.

Another issue is that someone from Brazil may want to search for color, except in Portuguese [at least according to Google translate], they would be looking for cor. That would turn up correlation just fine, but not anything color related. If you only look at the first two letters, you're going to get so many results that it really is useless.

Searching for color won't catch those cases where someone looks for hue instead of color. A better solution yet would be to look for synonyms and antonyms of the word and include them in the search.

There are lots of optimizations and tweaks that could be made. Is the solution to just let Google handle the searching?

I don't particularly take issue with the three letter search, but I think it could be handled better.

Here's kind of my starting algorithm for determining relevance rating

  1. The title contains the search term
  2. The description contains the search term
  3. The title contains synonyms of the search term
  4. The description contains synonyms of the search term
  5. The title contains antonyms of the search term
  6. The description contains antonyms of the search term
  7. The title contains a word starting with the first three letters
  8. The description contains a word starting with the first three letters
  9. The title does not contain anything related to the search term
  10. The title contains all uppercase or more than one exclamation point

That would be a place to start. Then some kind of machine learning algorithm would be applied to that to see what people actually click on and learn the scoring to apply to each of those criteria.

laurakgibbs
Community Champion
Author

 @James , do you really think it is helpful to return "electricity" search results in a search for "elephant"? Your list of 10 items fails already at item 1: if my search term is "elephant," then I do NOT want to see titles that contain the term "electricity" or "element." I am searching on ELEPHANT. I am NOT searching on ele* ... but that is what the Commons search is doing to my search term, mixing in elephant with ele* and there is no way I can stop it.

One easy way (I suppose) to fix this problem would be to at least let us put quotation marks around our search term/phrase to STOP this ridiculous wild card search. But if I put quotation marks around my search term, they are stripped.

Some more examples, since you do not seem to appreciate how poorly users are being served here:

Cherokee: highest rated.
https://lor.instructure.com/search?sortBy=rating&q=cherokee
Look: there is nothing Cherokee on the first page of results. NOTHING. t's chemistry, chemistry, and more chemistry. There is a Cherokee Nation quiz in Canvas Commons; given how poor the search is, I have no idea if there are other Cherokee materials or not. Which means the search fails: my search term is Cherokee, but all I see here is che*. Highly rated che* ... but that is NOT what I am searching for. I am searching for Cherokee, not che*

cherokee_ highest rated

New York: highest rated.
https://lor.instructure.com/search?sortBy=rating&q=New%20York
Since putting quotation marks doesn't help me delimit the search, I am getting highest rated items that contain "new". Is that helpful? It is not. There's one New York Times item way down at the bottom of the first page (I can use Control-F to search for that). Most relevant does return some New York items... but not a lot, and I despair of how to do a better search for a compound phrase like "New York." I'm a good searcher, but Canvas Commons prevents me from finding NEW YORK here, forcing me to wade through results full of "new" but no "york" of any kind (not apparently anyway). None of your search skills will do you any good when the search algorithm insists on interpreting every search as a wild card search based on the initial three letters of your search term.

New York_ highest rated

But look: I see Mike Caulfield there, with his highly rated Canvas Commons course on fake news. And that's interesting because Mike sharing his Canvas Commons course at Twitter is what prompted me to go on this doomed quest of searching the Commons and documenting my problems. 

Unfortunately, though, his last name has the same first three characters of his name with the word "cause" ... so he is also doomed. "Caulfield" is almost at the bottom of the "most relevant" search first page; I have to use Control-F to find him... because he is not the most relevant. CAU* items (like cause, causes) are more relevant than CAULFIELD... even though my search term is "Caulfield" 

https://lor.instructure.com/search?sortBy=relevance&q=caulfield ... that's why he is nowhere to be seen in these supposedly "most relevant" search results. Yes, Caulfield in the title should trump Caulfield the author. But Causes in the title should NOT trump Caulfield the author.

Caulfield_ most relevant

As I see it, these searches are failing. The wildcard search default needs to be fixed. I hope the information I've provided will be useful in doing that, either as a bug report (which the Helpdesk refused to take) or as a feature request ... which I guess will take a year or two, assuming people vote this up. At least I have done my digital duty and documented the problem as I see it.

I would expect better from Canvas, and by saying that, I am expressing my positive opinion of Canvas, and my expectation that, yes, eventually they will fix this. But I had no idea I would find a situation this grim when I decided to look for Mike's stuff in the Canvas Commons. And his stuff, as I said before, is AWESOME.

At least the URL works:
https://lor.instructure.com/resources/502b782a3d094ed68fd53e237a9cf05c

caulfield

James
Community Champion

Some more examples, since you do not seem to appreciate how poorly users are being served here:

Funny how one person can write something and someone else read it in a completely different way than it was intended.  We're actually advocating the same thing, I'm just saying it in a more technical way that would explain how it could be done rather than just saying what's wrong with the current situation. You made the use case with the first post, now it's another one of those situations while piling on additional evidence and witnesses ends up muddying the waters about what the real issue is. I would challenge you to move past proving it sucks (you've won that argument) and on to what needs to happen to make it right.

Yes, I do think that in some cases, it may be useful to deliver electricity for elephant. Maybe not this particular example, but for the reason why that example works. That's part of the beautify of the Community. What one person thinks is irrelevant and useless, another person can come in and say "I'm so grateful it does that."

A better approach to taking the first three letters might be to apply a similarity index to words. The Soundex code is one way to do that (elephant is E415 while electricity is E423), but I don't know how well that translates to other languages. There are probably newer, better ways to do find similar words. 

In the bigger picture, though, and this is where I think we agree, I don't see trying to adjust for spelling mistakes being the issue. Maybe you disagreed until you were able to move past the whole elephant / electricity issue, but once you did, you saw that the issue is the scoring that they're giving to each of those items, which is what I was suggesting be altered. There is nothing wrong with thinking that electricity might somehow be related to elephant, there is something wrong with prioritizing electricity over elephant when elephant is clearly a word in the system.

What I suggested was a better algorithm and in that algorithm, including words that attempt to fix spelling mistakes is way down on the list of what should be delivered first. What I suggested was that looking for related words through the use of synonyms or antonyms should produce higher returns than spelling errors.

The author's name could certainly be part of what is included in the search fields. I have no experience with Commons before this post. I don't know all of the details about what is included in a search and what isn't. It wasn't an intentional slight on Caulfield's work. Any direct appearance of the search word in any of the fields should take precedence over an attempt to fix spelling ... unless ... the number of direct hits is extremely limited or 0. I couch that statement with search algorithms are more complex, which means that writing them is more complex. That's where machine learning can come in to help figure out the best algorithm to use to rank them.

The sorting by rating or date could factor into that algorithm rather than the way Canvas does it. What is happening now is that it seems that you get one list of items. For each item there are three scores: relevancy, rating, and date. Sorting just reorganizes the entire list, rather than a combined score for each one. That is typical for a computer science person, but not what a human wants. In Google, the date is a way to filter the content, not a way to sort it. In Amazon, you can choose the number of stars to filter out the content that meet that criteria.

Canvas has made them sort options rather than filters, which is where they should be.

The use of quotes for verbatim text or a + in front of a word to say it must be there would be a welcome addition and should be included. Some search engines even allow the use of - to exclude a certain word. You might even be able to specify where the content should occur, like author:caulfield. All of those would help the search process.

Here's what I would expect out of a search in Commons.

  1. Items that directly contain my word come first. Preference given to the title, description, and author in some order. If there are other fields that are searchable, I'm not purposefully excluding them.
  2. I would expect a search to include some kind of relevant but not direct hits as well. This can be determined by synonyms, antonyms, homonyms, homophones, xylophones, soundex, or whatever else seems appropriate, including looking at the first three letters of the word. Similar results would not be ranked as high as direct hits and would come lower in the list.
  3. The ability to put modifiers into the search. Quotes for verbatim and phrases, + for must include, - for must not include. You can also have field specific searches like title:elephants, description:orwell, or author:caulfield.
  4. Ratings and Date are filters rather than just alternative ways of sorting the same dataset.

How does that line up with what you want to see in a search?

laurakgibbs
Community Champion
Author

Find me a person who's grateful for electricity when they search for elephants. I will give you ALL my points as a gesture of amazed wonder. 🙂

Your ideas for better searches are good, but complicated. Very much worth discussing. And I would prefer for that discussion to be led by someone with real search experience. An OER strategist would be such a person, so I'll keep lobbying for that. But a search engineer would be fine too. The serious flaws with this current search make me wonder just what kind of search expertise the Instructure team can call on. 

If I can persuade them to call this a bug fix, it's needs a quick fix. I would say that would be allowing us to put quotation marks around the search term/phrase to shut off the (wretched) 3-character wildcard results.

I really want this to be a bug fix, not a complete redesign of the whole search which would take a couple of years. Or maybe never happen at all.

ProfessorBeyrer
Community Coach
Community Coach

Thank you laurakgibbs‌ for your well-documented exploration of the problem. I had wondered why I receive so many strange results when I do a search. The other filters there are helpful, but the keyword matching would not quite work. Here's what I would add to your request:

  • Use the browser's spell-checker to the Canvas Commons search results page to offer alternate search terms, a la Google:
    search results

Canvas already relies on the browser's spell-checker so extending this reliance to Canvas Commons should be possible.

laurakgibbs
Community Champion
Author

I'm so glad to hear it's not just me who found it strange,  @ProfessorBeyrer ‌! To be honest, I had not been very interested in Canvas Commons since I work on the open Internet instead of using content inside the LMS, but on the few occasions I had searched for something at the Commons I was just baffled. It wasn't that important, though, so I always just moved on and didn't pay much attention to it. Then, when I did stop to pay attention (because I was trying to figure out global Canvas search might work if/when we ever get search), it was pretty easy to figure out that there was this weird 3-character search going on.

I am definitely spoiled by the way Google notices typos and spelling errors in searches (although Google thinks "elefant" is just fine because there are so many good search results on that spelling! ha!); it's very handy. I would find it a bit tedious to search separately for both elephant and then separately for elephants, which is what would be needed if Canvas adopts a strictly "elephant" option (searching for elephant exactly, with no stemming). But honestly, I would like to see that quick fix applied (elephant returns electricity as before, but "elephant" would only return strict matches on elephant) so that we can get something in place now: let users turn off the wildcard search by using quotation marks around the search word/phrase.

Coming up with an optimized algorithm, which is the direction James is going in, and looking for other kinds of UI support (like spelling correction) would definitely be good... but those are the kinds of things that would take a long time to get into place. I'm still hoping there might be some kind of quick fix that can take place soon, separately from a bigger project of re-examining Canvas Commons search in more complex ways.

So, until somebody just tells me NO (someone other than the Helpdesk), I am going to hope that maybe maybe maybe they will be willing to consider a bug fix here ("elephant" as opposed to elephant). Otherwise, we are looking at a months- or years-long feature development process.

And I will repeat: this is the kind of thing that makes me think Instructure really needs an OER strategist...

https://community.canvaslms.com/ideas/12042-instructure-needs-an-oer-strategist 

🙂

James
Community Champion

Could you -- until Canvas makes a permanent fix -- deal with exact matches in the title, description, and author, but not tags?

I did some more looking and when sorted by relevance, it seems that Canvas looks at title, tags, and author -- but not description -- and puts them first. However, when they display the preview cards, the tags are not included, but the description is.

That "Final Essay" with the picture of the cow on it only contains the word "Elephant" as a tag after you go into the card to look at the whole thing. If I was to write a script that would allow exact matching, for expediency purposes it would only look at what was available on the search page rather than refetching the information for every item to check the tags. That means that the Final Essay thing would disappear if you did an exact search for elephant.

To get a good idea of things that are tagged without having it in the title, search for rubric.

On the other hand, an exact search might be able to pull up things with Elephant in the description. The issue there is going to be the "Load more data" prompt. It doesn't load all of the data, only the ones that match the title, author, and tags. Looking for an exact match for "Elephant" would eventually pull up a "elephant" in the description and get you Orwell, but it may not be in the first 48 or so that it loads, you might get 5 hits on the first page, one of which (Final Essay) would disappear and then you would have to click Load More to hopefully get more that had elephant in the title.

Sound confusing? Like I said, things are rarely simple when you go to do something that isn't already built in.

The best solution here is for Canvas to do something about it, I'm just trying to help out in the meantime.

laurakgibbs
Community Champion
Author

Until all the clutter is removed (all the ele-NOT-elephant), it's really hard to get a sense of what is going on. But yes, when Mike Caulfield's course came up in the New York search, I knew it had to be that he had news in there somewhere, and interestingly it was not in the title or in the description; instead, it was one of the tags: news literacy. (But of course it had no business being in the search results for New York to begin with...).

And, again, this is why I think we need someone who has experience with content repositories to weigh in here. I'm not confident about generalizing from my search habits or from my search needs to comment; I've never designed a search system, and I'm not qualified to do so. I don't even know the terminology to describe the different processes (although I enjoyed learning about stemming; I read the Wikipedia article that Dan shared about that at Twitter). 

I just know that the clutter in the search now is so bad that I would call it unusable. My only concern at this point is to get rid of the wildcard search results that are just not helping. The opposite: they are preventing real search from happening.

James
Community Champion

 @ProfessorBeyrer , I'm not sure I follow the request so I want to make sure we're on the same page.

The search is done on the Canvas Commons back end, not in the browser. That means that the browser's spell checker isn't available to the search engine and any accessing of the browser's spell check would have to happen before the content was sent to Canvas Commons for searching. 

Searches show that at one time Google had an undocumented spell-checking API, but that has been discontinued. If it's out there now, it's not showing up in any of the searches I can think of. That means that we may not be able to benefit from Google's knowledge of typos unless you make a request to Google and then search for the phrase "Did you mean xxx?" or whatever the current language is.

What is possible is to add the spellcheck="true" attribute to the input field. It's an HTML5 attribute and has major browser support except for Microsoft Edge (I don't think many of us in the Community lose sleep over that).

That would give me something like this

285345_pastedImage_2.png

I can then right click and see this

285346_pastedImage_3.png

If you click on the "Ask Google for suggestions", you get this:

285368_pastedImage_5.png

In this case, that wasn't necessary because Chrome already suggested elephant for elphant. 

However, if I type elfhent, then Chrome suggests hellbent. With Google suggestions turned on, it comes back as elephant. But I didn't see any network traffic in the developer tools that happened when I toggled that option. It obviously happened, they just didn't show it. I suppose someone could throw a network proxy and see what traffic got sent to Google and what the response was. Still, it's probably not worth it and it would be a violation of Google's terms of service, so Canvas wouldn't do it. Given the way it works -- it starts fetching results as soon as you type in letters, that could get expensive if it was available.

That option to ask Google is also presumably only available with Chrome and not Firefox.

It turns out that I didn't even have to add spellcheck="true" to the field for Chrome, because I had automatic checking set under the Check Spelling menu.

285369_pastedImage_6.png

I do, however, have to hit space or something so it knows I'm done with the word.

On the other hand, Firefox will let you know that you've misspelled the word without the space at the end as long as the input loses focus.

285370_pastedImage_7.png

Since spell checking is already available and a user option, are you saying that we should force spellcheck on for that field and override any user preferences (most likely they didn't know they could turn on spell checking)?  Or do you have something else in mind completely?

Along with that comes another question.

I'm trying to think of the best way to implement this as a Canvancement. I'm thinking some kind of toggle, similar to the "Show Public Resources" button they have -- of course that's done with React and without standard classes and not exposed to the user, which means it may end up being a simple checkbox.

One of the things I was thinking about was that if the user put a space at the end, that it would look for the entire word, not just words that start off with those letters, but if putting a space at the end is necessary to get the spell check to work, then that wouldn't necessarily be a good option.

How should I determine when a user wants the full word verses just a fragment? If someone wants electricity, electrician, and electronics, should they just type electr (to avoid election). On the other hand, if they want electric and type electric, how do I make sure that they don't get electricity and electrician as well? How should that search be specified?

It is a series of filters specified after the search? 

* Hide alternative spellings

* Match entire word

Or is it a search feature like surround the word in quotes for an exact word match: "electric" does not match electricity?

I was originally going to use a space at the end for an exact word match, but the spelling thing throws that out.

I don't think I'll be able to do a search on fields like title:elephant, description:orwell, or author:caulfield. The search returns hits containing the words title, description, or caulfield. I am not interested in completely rewriting the search and recreating all of the content, just cleaning up what they have already provided.

Finally, there is a Canvas Commons API available with a List Resources endpoint. The q= parameter tells it "The text to search for in the title, description, tags, author name, and account name". The JSON returned by the request does include the tags, which are not stored on the page. But I don't know that I want to refetch all of the information to figure out what to display just for the tags. One of those returned 80k (15.4k compressed) for 48 items, but it had the don't cache property set so it would have to be fetched again, not just loaded from the browser's cache. We'll have to see how things pan out.