Relevant Elephants: Fixing Canvas Commons Search

This idea has been developed and deployed to Canvas

I'm not really sure how to write this as a feature idea because, as far as I am concerned, it is a bug that needs to be fixed. But my dialogue with the Helpdesk went nowhere (Case #03332722), so I am submitting it as a feature request per their advice. I am not proposing how to fix this problem; I am just going to document the problem. People who actually know something about repository search would need to be the ones to propose the best set of search features; I am not an expert in repository search and how to fix a defective search algorithm. But as a user, I can declare that this is a serious problem.

If you search Canvas Commons for elephants, you get 1084 results. Here are the "most relevant" results:

https://lor.instructure.com/search?sortBy=relevance&q=elephant 

There are some elephants, which is good... but a lot of other things that start with ele- ... which is not good. Apparently there is some component in the search algorithm which returns anything (ANYTHING) that matches the first three characters of the search string.

Election Unit Test.

Static Electricity Virtual Lab.

And so on. And so on. Over 1000 false positives. ele-NOT-elephants.

most relevant elephant search

As near as I can tell, there might be a dozen elephants in Canvas Commons. I've found six for sure; there could be more... it's impossible to find them. Impossible because of the way that search works (or doesn't work). The vast -- VAST -- majority of results are for electricity, elections, elearning, elements, elementary education, electrons, and anything else that starts with ele.

You might hope that all the elephants are there at the start of the "most relevant" search results... but you would be wrong. There are 5 elephants up at the top, but then "Static Electricity Virtual Lab" and "Valence Electrons and Isotopes" etc. etc. are considered more relevant than Orwell's essay "Shooting an Elephant" (there's a quiz for that). I have yet to figure out why Static Electricity Virtual Lab is considered a more relevant search result for "elephant" than materials for George Orwell's Elephant essay which actually involves an elephant.

I found out about Orwell's Elephant this way: when I search for "Highest Rated," the top-rated elephant is Orwell's elephant. There are lots of other highest rated items at the top, though, which have nothing to do with elephants, and that is why you cannot see Orwell's elephant in my screenshot. It's below all these other items in the screenshot. But if you scroll on down, you will find Orwell's elephant essay. Eventually.

I found it using Control-F in my browser.

Here is the search URL:

https://lor.instructure.com/search?sortBy=rating&q=elephant 

highest rated elephant results (with no elephants)


Switch the view to "Latest" and all the elephants are missing here too. Really missing. Well, you'll get to them eventually I guess if you keep loading more and more... and more and more.. and more. But no one is going to scroll and load and scroll and load to find the elephants, right? 

Here's the search term: elephant. But the search results are for ele- like elementary mathematics, elementary algebra, "Abraham Lincoln Elementary's 5th Grade beginning of the year prompt," and " the elements involved in warehouse management," and so on.

https://lor.instructure.com/search?sortBy=date&q=elephant 

latest elephants... but there are no elephants

I hopefully tried putting quotation marks around the word "elephant" to see if that would help. It did not. 

The Helpdesk tells me that this is all on purpose in order to help people with spelling errors:

That is how our search engine was set up. To search as specific as it can get and then to gradually filter in less specific content. This is done so that if a word is misspelled the search is still able to locate it.

if I type "elphant," then Google search shows me results for "elephant." That sounds good. It corrected my typo. But Canvas Commons gives me no elephants if I type "elphant." Instead it gives me two things: an item submitted by someone named Elpidio, and something called "Tech PD and Educational Technology Standards" which involves the acronym ELP. So much for helping people with spelling errors. 

Electricity, elections, elements, elearning: these do not sound good. Those results are obstructing the search; they are not helping. There is nothing "gradual" about the filtering. Static electricity shows up as more relevant than George Orwell's elephant. Some kind of three-character search string is driving the algorithm to the exclusion of actual elephant matches.

If you assume that someone who typed ELEPHANT really meant to type ELECTRICITY or perhaps ELEARNING, well, that is worse than any autocorrect I have ever seen. And I have seen some really bad autocorrect.

This happens over and over again; it affects every search.

Want to search for microsopes? Get ready for a lot of Microsoft. These are supposedly the most relevant microscope search results, but the second item is Microsoft ... even though it doesn't seem to have anything to do with microscopes at all from what I can tell.

https://lor.instructure.com/search?sortBy=relevance&q=microscope 

Still, we're doing better than with the elephants here. There are a lot of microscopes in addition to Microsoft:

microscope search; most relevant

But look what happens if you want highest-rated microscopes. See the screenshot; there are no microscopes. It's Microsoft Microsoft Microsoft. But hey, there is also Of Mice and Men!

https://lor.instructure.com/search?sortBy=rating&q=microscope 

So, the search algorithm assumes that, while I typed "microscope" as my search term, I might really have meant to type "Of Mice and Men." Or Microsoft. Or the name Michael (a lot of content contributors are named Michael) (or Michelle).

highest rated microscopes

I could go on. But I hope everybody gets the idea. If this is really a feature of Canvas Commons search and not a bug (???), I hope this three-character search string "feature" can be replaced with a better set of search features.

Although I still would call this three-character approach to search a bug, not a feature. Which is to say: I hope we don't have to wait a couple years for the (slow and uncertain) feature request process to warrant its reexamination.

Comments from Instructure

For more information, please read through the https://community.canvaslms.com/docs/DOC-15588-canvas-release-notes-2018-10-27 .

51 Comments
laurakgibbs
Community Champion
Author

 @James ‌

definition: "A software bug is an error, flaw, failure or fault in a computer program or system that causes it to produce an incorrect or unexpected result, or to behave in unintended ways."

If someone wants to tell me that highest rated search results for thermometer were INTENDED to include items with the word "the" to the exclusion of "thermometer" results, then I'll agree. But surely this is UN-INTENDED, which means it is a bug.

Search for thermometers... but there are no thermometers on the first page of search results. Just lots of things with "the" ... The Great Gatsby! That is not useful. And I cannot believe it is intended.

Oh, I noticed: also the name Theresa. 🙂

thermometer search results

James
Community Champion

Stop sorting by highest rated! When you sort by relevancy, then Thermometer does show up. You're distorting the situation to make your case seem more outrageous, but it's hurting by just you just keep saying the same thing over and over.

This is not one of those cases where there is no sort applied at all. There is a deliberate sort involved and someone decided, yes intended, that they should try to add in relevant terms. When you sort by rating, which is what you're doing to show how screwed up things are, it gives you the best rated out of all of the items. The more you beat this up, the more logic I see in Canvas' implementation, but I still think it should be a filter or that you could sort by just title or that you could require that the search word be in there and ignore suggestions. That's what advanced search is. You don't have to stop adding in items that match the first three letters, which some people can benefit from, because you would have a way to exclude those -- that's what's missing. That's what your entire argument boils down to. If you could exclude those extra matches, you would be happy (at least happier).

The https://community.canvaslms.com/ideas/10762-advanced-search-needed-for-canvas-commons would fix your issue and that idea was put out there first. The Community tries to keep a single idea out there, giving preference to the one that is made first. Just because you have more written under your idea than they do,  doesn't mean that your idea is better or should go forward when someone else had it first. Pull together one consistent argument that clearly lays out the problem and put it in that discussion so that people will see it. Once an idea is archived, like this one is (will be), its visibility is greatly reduced, which means that you're now advocating in a location where it won't do any good. Provide a link back to this one if you want for context, but realize that some people won't be able to access it because it's archived.

laurakgibbs
Community Champion
Author

 @James , if you feel like you have to tell me to "stop sorting by highest rated" proves, again, that this is a bug, unintended consequences.

The intention with "highest rated" is that I should be able to sort the matching materials by highest rating. That's a great feature... but the bug in the search is preventing that feature from working. It is NOT working as intended because if I search for highest-rated thermometer materials (a perfectly reasonable search), I see no thermometers on the first page of search results. If someone from Instructure wants to tell me that they really intended for the three-character-wildcard search to do this to the search results, I'll agree that this is an intended feature.

But until someone is willing to say that, I'm going to insist it is a bug.

So far, the stated intention is, per the Helpdesk, as follows: "This is done so that if a word is misspelled the search is still able to locate it."

In addition, the Helpdesk claims: "That is how our search engine was set up. To search as specific as it can get and then to gradually filter in less specific content."

On both accounts, the search is not performing as intended:

The three-character-wildcard search does NOT "locate" a misspelled word (it just searches on the first three characters of the misspelled word, which is not the same thing).

And the search does not "gradually filter in less specific content." The less specific content comes BEFORE the more specific content, and this problem is acute for the highest rated and most recent searches, although it also interferes with the most relevant search, in which wildcard search marches can come before exact search term matches depending on the weighting of title, tag, author, description, etc.

If the intention (per the Helpdesk) is to help with misspelling and to gradually display less specific content after more specific content, the results are not working as intended, which is to say: a bug.

A bug fix might be putting in the option to add quotation marks around the search term/phrase to disable the wildcard matches. Not a great fix, but bug fixes are usually not great. They are just bug fixes. Other problems with search remain because this was poorly designed. But that's separate: what I am talking about here is a bug fix, i.e. getting back to the intended purpose of seeing more specific results before garbage (three-character wildcard) results.

James
Community Champion

This is the same helpdesk that you said originally blamed the user as the problem, yet now you want to hold them to speak for all of Instructure?

laurakgibbs
Community Champion
Author

I'm going with what I've got: nobody at the Community responded to the elephant search problem, so I submitted a ticket to the Helpdesk, and they did respond. That's all I've got to go on re: intended performance of the search.

I would be VERY GLAD to hear from someone else about what is going on here, and whether the three-character wildcard search is having the intended results. It is very hard for me to believe that these are the intended results.

I don't see it as a matter of "blame" (???), but there is an issue of responsibility here: I think it is irresponsible to host a content repository where the search is buggy. I hope they will fix the bug sooner rather than later. And later, yes, I hope they will improve the search based on the other search idea proposals people have made.

karen_bowden
Community Contributor

I am frustrated on many of today's new websites that a simple boolean search does not work. If we could use operators to help the search find what we need everyone might be happier.

James
Community Champion

 @karen_bowden , if you haven't caught it yet, the Community moderators are redirecting people to an existing feature idea and will be archiving this one. You should add your comments there so that people will be able to see them: https://community.canvaslms.com/ideas/10762-advanced-search-needed-for-canvas-commons . 

The Boolean search isn't anything that I will be able to write a patch for, but they certainly fit in with the bigger picture if Canvas is going to redo the search.

karen_bowden
Community Contributor

Thanks, James.  Smiley Happy

laurakgibbs
Community Champion
Author
robotcars
Community Champion

A better approach to taking the first three letters might be to apply a similarity index to words. The Soundex code is one way to do that (elephant is E415 while electricity is E423), but I don't know how well that translates to other languages. There are probably newer, better ways to do find similar words. 

 

In the bigger picture, though, and this is where I think we agree, I don't see trying to adjust for spelling mistakes being the issue. Maybe you disagreed until you were able to move past the whole elephant / electricity issue, but once you did, you saw that the issue is the scoring that they're giving to each of those items, which is what I was suggesting be altered. There is nothing wrong with thinking that electricity might somehow be related to elephant, there is something wrong with prioritizing electricity over elephant when elephant is clearly a word in the system.

 

What I suggested was a better algorithm and in that algorithm, including words that attempt to fix spelling mistakes is way down on the list of what should be delivered first. What I suggested was that looking for related words through the use of synonyms or antonyms should produce higher returns than spelling errors.

Apologies... going nerd out here for a moment. My wife and I are expecting soon, and we downloaded and installed the Social Security - Popular Baby Names from 1880 to 2017 into MySQL on Amazon RDS. Trying to compare all of the names and variations of name spellings led me to find the Levenshtein distance - Wikipedia algorithm.

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

elephant vs. electricity = 7

elephant vs. elphant = 1

Seems like a pretty basic feature for any search algorithm, and can be found via Google for any language.

/nerding out