Kanopy Brings Streaming Video to Libraries
Rosette enables better search of streaming content
Introduction
A few short years ago, checking out popular DVDs from the library was a new and radical idea. Today library patrons want to “borrow” films via streaming video, and Kanopy fills that need. Libraries sign up with Kanopy on a pay-per-view basis. Freed from the cost and labor of acquiring, storing, and upgrading physical video media, libraries and universities gain access to Kanopy’s collection of 30,000+ independent films, documentaries, classics, and foreign movies.
The Need
Kanopy loves rare films, but not a search engine that makes them hard to find. When patrons can’t locate the films they want to watch, it impacts Kanopy’s revenue.
In the early stages of Kanopy, which was founded in 2008, search was less than ideal. Patrons had to type in an exact title to find their film. Thus, typing “Wait for Harry” would not find “Waiting for Harry.”
“We were getting okay results, but not great with the default [Apache] Solr search, [and] we had all sorts of issues,” Simon Deconde, Kanopy’s principal architect, said. “Search keywords needed to have no spelling mistakes. When a user got only 10 results, they could have gotten 50.”
Users also wanted to see their queries expanded to related searches, especially if they were looking for films around a concept. If, for example, a patron searched for “ecology” films, but the Kanopy database tagged that as “ecological” or didn’t tag that idea at all, many relevant films wouldn’t be found.
“When we identified that search was not great, some of our customers had the same opinion,” Deconde said. “It was just too hard to find titles.”
The Solution
Deconde and his team tried different tactics to improve search, such as a stemming dictionary to relate “waiting” with “wait.” Stemming is based on language-specific rules for removing characters from the ends of words. It is easier to implement, but different configurations tend to overstem (i.e., linking two unrelated words, such as coding
and cod
stemmed to cod
), or understem (i.e., not stemming enough to link two related words coding ⇒ cod
, but code ⇒ code
).
Problems with Stemming
Overstemming (high recall, low precision) Result: Some unrelated words are linked in searches by a common stem, resulting in lower precision. coding ⇒ cod cod ⇒ cod |
Understemming (medium recall, low precision) Result: Some related words are not linked in searches, resulting in lower recall. coding ⇒ cod code ⇒ code |
Although stemming might increase search’s recall (better recall is finding a greater percentage of all possible right answers), it would often decrease precision (precision is returning fewer wrong answers).
“[The option] might improve one part of search, but reduce quality in another, or would not support as many use cases as we needed,” Deconde said. “Recall was key, but precision matters, too. It was a balance [we sought].”
Deconde considered different commercial and open source solutions, “but those solutions were either impractical or had to be maintained,” he said.
Kanopy eventually chose Rosette text analytics for its lemmatization function, which increases recall and precision. Furthermore, the Rosette plugin for Solr made integration quick and easy.
Lemmatization finds the dictionary form (lemma) of a word based on an understanding of the language and the context in which the word appears. It is more sophisticated than stemming, as it requires dictionary data and morphological data — such as parts of speech. In “Jack spoke to me,” the lemma of the verb “spoke” is “speak,” but given “the wheel spoke broke,” the lemma of the noun “spoke” is “spoke.” Words with the same lemma (tells/telling/told⇒ tell; mouse/mice ⇒ mouse) can be related, thus enabling a search query to be expanded without losing precision.
Lemmatization Examples
WORD | LEMMA |
---|---|
Jack spoke to me. | speak |
The wheel spoke broke. | spoke |
tells / telling / told | tell |
mouse / mice | mouse |
lucky / luckier / luckiest | lucky |
When searching a database of 30,000+ movies, finding just something isn’t helpful. It’s about locating the right thing or, as Deconde said, finding results closely related to what you were looking for.
“Using Rosette increased the chance of finding relevant titles on Kanopy. It made sense for us, even as a small company, to invest in [Rosette],” he said
“We really didn’t have to think too much about it. We installed [Rosette], it worked, and we moved on to other areas, which was exactly what we needed. When we can find shortcuts like that, it’s worth it.”
The Results
After the introduction of Rosette, Deconde reported “a great improvement” in search results, and added, “We stopped hearing as many complaints.”
In January 2019, Entertainment magazine reported that Kanopy partners with 4,000 public libraries and academic institutions.1