Relevant Elephants: Fixing Canvas Commons Search

This idea has been developed and deployed to Canvas

I'm not really sure how to write this as a feature idea because, as far as I am concerned, it is a bug that needs to be fixed. But my dialogue with the Helpdesk went nowhere (Case #03332722), so I am submitting it as a feature request per their advice. I am not proposing how to fix this problem; I am just going to document the problem. People who actually know something about repository search would need to be the ones to propose the best set of search features; I am not an expert in repository search and how to fix a defective search algorithm. But as a user, I can declare that this is a serious problem.

If you search Canvas Commons for elephants, you get 1084 results. Here are the "most relevant" results:

https://lor.instructure.com/search?sortBy=relevance&q=elephant 

There are some elephants, which is good... but a lot of other things that start with ele- ... which is not good. Apparently there is some component in the search algorithm which returns anything (ANYTHING) that matches the first three characters of the search string.

Election Unit Test.

Static Electricity Virtual Lab.

And so on. And so on. Over 1000 false positives. ele-NOT-elephants.

most relevant elephant search

As near as I can tell, there might be a dozen elephants in Canvas Commons. I've found six for sure; there could be more... it's impossible to find them. Impossible because of the way that search works (or doesn't work). The vast -- VAST -- majority of results are for electricity, elections, elearning, elements, elementary education, electrons, and anything else that starts with ele.

You might hope that all the elephants are there at the start of the "most relevant" search results... but you would be wrong. There are 5 elephants up at the top, but then "Static Electricity Virtual Lab" and "Valence Electrons and Isotopes" etc. etc. are considered more relevant than Orwell's essay "Shooting an Elephant" (there's a quiz for that). I have yet to figure out why Static Electricity Virtual Lab is considered a more relevant search result for "elephant" than materials for George Orwell's Elephant essay which actually involves an elephant.

I found out about Orwell's Elephant this way: when I search for "Highest Rated," the top-rated elephant is Orwell's elephant. There are lots of other highest rated items at the top, though, which have nothing to do with elephants, and that is why you cannot see Orwell's elephant in my screenshot. It's below all these other items in the screenshot. But if you scroll on down, you will find Orwell's elephant essay. Eventually.

I found it using Control-F in my browser.

Here is the search URL:

https://lor.instructure.com/search?sortBy=rating&q=elephant 

highest rated elephant results (with no elephants)


Switch the view to "Latest" and all the elephants are missing here too. Really missing. Well, you'll get to them eventually I guess if you keep loading more and more... and more and more.. and more. But no one is going to scroll and load and scroll and load to find the elephants, right? 

Here's the search term: elephant. But the search results are for ele- like elementary mathematics, elementary algebra, "Abraham Lincoln Elementary's 5th Grade beginning of the year prompt," and " the elements involved in warehouse management," and so on.

https://lor.instructure.com/search?sortBy=date&q=elephant 

latest elephants... but there are no elephants

I hopefully tried putting quotation marks around the word "elephant" to see if that would help. It did not. 

The Helpdesk tells me that this is all on purpose in order to help people with spelling errors:

That is how our search engine was set up. To search as specific as it can get and then to gradually filter in less specific content. This is done so that if a word is misspelled the search is still able to locate it.

if I type "elphant," then Google search shows me results for "elephant." That sounds good. It corrected my typo. But Canvas Commons gives me no elephants if I type "elphant." Instead it gives me two things: an item submitted by someone named Elpidio, and something called "Tech PD and Educational Technology Standards" which involves the acronym ELP. So much for helping people with spelling errors. 

Electricity, elections, elements, elearning: these do not sound good. Those results are obstructing the search; they are not helping. There is nothing "gradual" about the filtering. Static electricity shows up as more relevant than George Orwell's elephant. Some kind of three-character search string is driving the algorithm to the exclusion of actual elephant matches.

If you assume that someone who typed ELEPHANT really meant to type ELECTRICITY or perhaps ELEARNING, well, that is worse than any autocorrect I have ever seen. And I have seen some really bad autocorrect.

This happens over and over again; it affects every search.

Want to search for microsopes? Get ready for a lot of Microsoft. These are supposedly the most relevant microscope search results, but the second item is Microsoft ... even though it doesn't seem to have anything to do with microscopes at all from what I can tell.

https://lor.instructure.com/search?sortBy=relevance&q=microscope 

Still, we're doing better than with the elephants here. There are a lot of microscopes in addition to Microsoft:

microscope search; most relevant

But look what happens if you want highest-rated microscopes. See the screenshot; there are no microscopes. It's Microsoft Microsoft Microsoft. But hey, there is also Of Mice and Men!

https://lor.instructure.com/search?sortBy=rating&q=microscope 

So, the search algorithm assumes that, while I typed "microscope" as my search term, I might really have meant to type "Of Mice and Men." Or Microsoft. Or the name Michael (a lot of content contributors are named Michael) (or Michelle).

highest rated microscopes

I could go on. But I hope everybody gets the idea. If this is really a feature of Canvas Commons search and not a bug (???), I hope this three-character search string "feature" can be replaced with a better set of search features.

Although I still would call this three-character approach to search a bug, not a feature. Which is to say: I hope we don't have to wait a couple years for the (slow and uncertain) feature request process to warrant its reexamination.

Comments from Instructure

For more information, please read through the https://community.canvaslms.com/docs/DOC-15588-canvas-release-notes-2018-10-27 .

51 Comments
laurakgibbs
Community Champion
Author

carroll-ccsd I love learning new things, so Levenshtein distance is great! It reminds me of that chain-letter-game that they used to give us in school, how you can get from one word to another step by step: very fun game for kids to play with spelling and vocabulary.

from CAT to DOG in three steps:

CAT is a word

COT is a word

COG is a word

DOG is a word

But here's the thing: the three-character-wildcard seems built into the system in a more fundamental way; only the engineers can tell us. It's not doing a letter-by-letter decrement of the search term (as near as I can tell), and it is not doing stemming or working on the lemma: it really looks like it is just adding a three-character-wildcard search into the mix. Some "relevance" weighting means that the "most relevant" searches are not totally hosed (but they are still pretty bad: search on AIDS and the first results are NOT for AIDS, but for aid, even on the most relevant results; you have to scroll down to find AIDS), but the "latest" and "highly rated" searches are often totally hosed if the search term shares its first three characters with other words, like poor Andrew Jackson and his jacuzzi.

So, I'm waiting to find out more about what's really going on with the three-character-wildcard and how it fits into the algorithm. It's the only thing that seems to explain the results I see.

Here's what I mean about how "first aid" is more relevant than "AIDS" when you search for AIDS: first aid and financial aid are considered "more relevant" than an item that has AIDS in the title (that first Africa item does have AIDS as a tag). But you have to scroll on down to find a resource with AIDS in the title. It's search result #27 in the list. (I thought at first Poverty INC might have something to do with AIDS, but it's just aid).

search for AIDS

robotcars
Community Champion

I agree, and you've made the point, the algorithm depends too much on falling back to the wild card as 'success'. I also agree with James, that this doesn't constitute a 'bug', more like an incomplete feature/implementation.

Having written a dozen or so search engines over the years, including an LTI with search for resources in our own Canvas using MSSQL Full Text Search, I've had a habit of not executing a search request until the user has entered 3 characters, which seems to be different than this issue of fallback/success.

While I can't seem to search lor.instructure.com, probably, because scope/local settings.

I can use Commons in our instance, and here's what I get searching for content we have available.

To test this, I searched for finance, expecting zero results.

Here are the top results.commons_finance = find

Finance is not found. Find is found, in the description.

This module is from the District-developed English 8 S1 course created by CCSD BlendED digital learning initiative. Please join the PLC for that course to receive updated information. You can find all the PLCs in the “CCSD Hub” (link on the lower left in the Canvas global navigation)...

If I search for elephants, I get a Leonard Nimoy quote...

0 results.

laurakgibbs
Community Champion
Author

I don't know: if people start with a search that they want to filter for "highest rated" or filter for "latest," then the wildcard search hoses those results pretty thoroughly, in a way that I would say was surely not intended when users were offered those filters. The Jackson jacuzzi is my favorite so far in terms of a hosed highly-rated filter.

This all makes me wonder just who manages the Commons. I mean, engineers work on it (and only they can tell us what's going on inside the guts of the algorithm)... but who "owns" the process, so to speak? I have no idea. I hope we hear something from that person, whoever they are.

Same Nimoy quote shows up at the LOR for null results also. 🙂

Now if only we could ask Mister Spock what to do. 

Spock and Cat; Live long and prospurr

kona
Community Coach
Community Coach

carroll-ccsd, first congratulations on your wife's pregnancy. Very fun and exciting time. Second, what you did totally made me laugh because it sounds like something James did with our first baby. He probably remembers the details, but it involved a lot of downloading data and looking at all sorts of crazy things - even number of syllables. Ultimately we were in the recovery room with said firstborn - without a name - and I looked at James and told him we weren't leaving the recovery room without naming the kiddo. I then asked what sounded better? James, Kona, and Reagan or James, Kona, and X (other name we were considering). And, after all of James' research and the countless conversations and overly deep discussions (which is even worse when you've been a teacher and had students with a lot of the names you're considering), that's pretty much what sealed the deal. 

Good luck with not totally screwing up your kid and picking the wrong name. Just kidding... well maybe... 😉

Kona

laurakgibbs
Community Champion
Author

Just don't use the Troll Name Generator.

http://www.fantasynamegenerators.com/troll_wow_names.php 

You are so lucky with your very Googleable name, Kona! And "Kona Jones" just sounds good too. 

robotcars
Community Champion

Thank you  @kona !

Good call! Sometimes we need a reminder to stop analyzing. 

I hope we don't get that far without one either. We actually have some boys names we like, but picking girls names seems to slow us down. We tinkered with this data because my wife (Nora, who's name is actually French) moved here from Eastern Europe when she was a teen, and with 99% of her family there, we are trying to figure out names that translate easily, or sound good in Cyrillic, but can be pronounced by Americans. At least a nickname that works both ways. We eventually got overwhelmed and decided to table it, figuring some spark will catch. My dad who was a teacher for 34 years, has said the same thing about students, some names just get ruined. :smileygrin:

It's also kind of fun to learn analysis on that much data consisting of only 4 columns. Good practice.

Might have to tinker with this one too, PHP: Gender\Gender::country - Manual 

Not that I care about the gender of a name, more the locale/origin, like Irish/Slavic equivalents.

Besides, the data shows me gender doesn't really apply to names, most names appear in both 'sets'. fun fact.

Finally, I'm comforted knowing that personality ends up describing a Person more than the name we stick them with.

laurakgibbs
Community Champion
Author

Yay for Cyrillic. 🙂

maguire
Community Champion

laurakgibbs I could not resist your offer Smiley Happy. An example is someone looking for the 1903 film described in this wikipedia page: Electrocuting an Elephant - Wikipedia 

James
Community Champion

 @maguire 

My paper boring paper on the death penalty and how the electric chair played a predominant role in the United States just turned into a paper on animal cruelty. My professor is going to love it! Thanks.

laurakgibbs
Community Champion
Author

ouch... I actually know of the existence of that, although I have not watched it (I don't think I could bear it...).

although it is proof of the power of boolean search, if not wildcard search: you don't just want elephant, you don't just want electricity, you don't want ele* (which also includes elections and elementary schools): you want electricity and elephant fatally combined with AND. 

I always share this infographic with my students; Commons is no place to practice search skills, but Google is. 🙂

Google like a boss infographic