Semantic Search is Remedy for Keyword Inaccuracy in E-discovery
Semantic search is the aspirin to the headaches of keyword search in e-discovery. It is based on text embeddings, a natural language processing technology that has been around since the 1950s, but only became viable for business use in the last few years. Semantic search looks for meaning, not exact words, and enables teams to use English to search in languages they don’t know.
Why change the decades-long keyword search method? Keyword search catches too much irrelevant data, and too little of what’s really important. Under-generation of search results means important evidence is being missed, possibly exposing a client to litigation risk. Over-generation of search results obscures important documents with too many false positive hits and squanders review time by human attorneys, significantly increasing the financial cost of legal services.
The problems with keyword search
- Keywords catch too little due to misspellings, alternative word forms, and the many ways of saying the same thing.A 1985 study by David C. Blair and M. E. Maron evaluated the effectiveness of keyword search in a case concerning a subway accident. The event was called “the unfortunate incident” by one side and a “disaster” by the other side. Documents called it an “event,” “incident,” “situation,” “problem,” or “difficulty.” Malfunctioning mechanisms were called “sick,” “dead,” or “fried,” and a critical issue was deemed the “smoking gun.”
- Keywords return too much irrelevant information because they lack context. “Confidential” can be an important keyword that signals sensitive information, but because it appears frequently in email footers, it can’t be used in keyword search. The word “interest” could mean fascination, a repayment premium for a loan, or having property rights. However, searching for a phrase or sentence that provides context will rarely find a match in keyword search.
- Not knowing the “good keywords.” At the start of a case, whether for early case assessment or discovery, all the important keywords can’t be known until the attorney begins to manually review documents.
What is semantic search?
Semantic search is an intelligent fuzzy keyword search that finds matches based on the meaning of words. It addresses issues of context and variety that plague standard keyword search, which is locked into finding a particular set of words spelled a particular way. In contrast, a semantic search for “alcoholic drink” could yield “cocktail,” but also “whisky on the rocks,” “margarita,” or “pina colada,” and so on.
The technology that enables semantic search is called word or text embeddings. It is trained on a body of text and assigns a vector value for each word, phrase/sentence, or entire documents based on the context within which it appears. It is also effective at finding nearly duplicate documents. Text embeddings mathematically calculate the “distance” in meaning between text. Once values have been calculated for words in one language, the same can be done in other languages. Then, the various languages can be aligned so words that share similar meanings in one language will be close in value to those in another language.
Consequently, semantic search is possible across languages, so that “unmanned aerial vehicle” can find “drone,” “UAV,” “Predator XP,” “無人航空機,” or “Unbemanntes Luftfahrzeug.”
Benefits of semantic search to e-discovery
Attorneys review data found from keyword search and set aside any documents that are not relevant. John H. Beisner estimated this human review is 75-90% of the cost of producing the documents. Furthermore, the earlier referenced Blair and Maron study examined the case’s discovery database of 40,000 documents and 350,000 pages. Although the legal team felt confident , they had found 75% of relevant documents through keyword search, Blair and Maron discovered it was only 20%.
How semantic search solves keyword search issues
|Under-generation of results (low recall)||Because semantic search is looking at meaning, it doesn’t matter how a concept is described. A semantic search for “the unfortunate incident” could potentially find “disaster,” “event,” “incident,” “situation,” “problem,” or “difficulty.”|
|Over-generation of results (low precision) due to lack of context||Semantic search focuses on an idea or meaning and isn’t misled trying to exactly match a static word that may have multiple meanings. A 2002 study by Blair and Zipf found that in a group of 1,000 documents, there were 100 matches on “computing” with 10 different usages of the word. But in a group of 100,000 documents, there were 7,100 “matches” with 84 unique uses of the term “computing.” Semantic search avoids the many false positives that a single keyword will return.|
|Unable to know the “good” keywords at the start of discovery||Especially at the start of a case, semantic search can uncover the different ways that ideas might be expressed, and help the legal team quickly discover case-relevant phrases and key concepts. For a subway accident, “broken equipment” semantic search might find “fried circuit.”|
More NLP to help e-discovery
Semantic search is just one of many technologies in the arsenal of natural language processing tools that can reduce risk and manual labor for e-discovery. Other technologies include:
- Language identification — Reveals ahead of time what languages need to be handled in the discovery.
- Entity extraction — Can be custom trained to find particular entities. In the Blair & Moran study, the legal team sought documents about “steel quantity.” Efforts were stymied by relevant documents that only mentioned the number of steel things, such as “girders,” “beams,” “ frames,” and “bracings.” Entity extraction could find documents that mentioned quantities and “steel things.”
- Event extraction — Can be quickly custom trained to find events specific to a use case, and requires a small amount of training data to reach useful accuracy.
Semantic search reduces labor and costs for multilingual e-discovery
In today’s global economy, It’s not unusual for discovery to include documents in multiple languages. Traditionally, attorneys would turn to cross-lingual keyword search, which meant either machine translating the keywords into another language, or machine translating the documents being searched. Translating keywords suffers from the same weakness as regular keyword search: lack of context. Should the word “interest” be translated to Japanese as 利子 (as in interest on a loan), 興味 (attraction), or one of the four other definitions of “interest”?
Machine translation of the documents to be searched is the other approach, but errors mean potentially relevant results may be “lost in translation.” For example, the Chinese phrase 吃醋 means “to be jealous,” but literally it’s “to eat vinegar.” Google Translate correctly translates 吃醋 to “jealous,” but the phrase used in a sentence 你还会吃前女友的醋吗？(=Are you still jealous of your ex-girlfriend?) becomes “Would you still eat your ex-girlfriend’s vinegar?”
Semantic search across languages minimizes errors from translation by searching the text as written, and while there may be some fuzziness, the meaning will be true. More significantly, because semantic search can help discover key concepts and case-relevant phrases in the different languages, they can be bootstrapped to create a starter glossary for machine translation to minimize labor by the human contract attorney. They also enable software to automatically categorize files based on key phrases expressing similar ideas.
Learn more about this technique from our webinar Oct 12, 2022 “Extracting Essential Meaning in Multi-Language e-Discovery” Or, at the Text Analytics Forum 2022, where Eugene Reyes of Basis Technology and Jason Boro Esq. of Linguistic Systems, Inc. will present “Augmenting Translation & Search In Ediscovery With Semantic Phrase Detection.” They will discuss and demonstrate a prototype tool that uses cross-lingual semantics to overcome the challenge of finding relevant documents in multiple languages, while also bootstrapping more accurate machine translation. Their presentation takes place on Thursday, Nov. 10 from 10:15 to 11 a.m. at the J.W. Marriott in Washington, D.C.
To learn more about how AI-powered natural language processing software can be integrated into your e-discovery workflow to intelligently increase efficiency and reduce risk, contact email@example.com.