25 Oct 2017
Blog

A Document’s Vital Stats: Keyphrases and Concepts


New Rosette Cloud topics endpoint enables summarization, content organization and trend analysis

We are creating new content online at an unprecedented rate. Globally, we compose 3.6 trillion words every day on email and social media, the equivalent of 36 million books.* Managing and deriving value from that volume of text data can only hope to be accomplished through automation.

The Rosette 1.8 release last week included a new topic extraction endpoint that can help users do just that. For a given input, /topics extracts “keyphrases” and corresponding “concepts.” Keyphrases are significant phrases or words taken directly from the text that Rosette deems to be representative of the content. Concepts are themes detected within the text that may not be explicitly mentioned in the input.

As a new endpoint, /topics is still “labs” status so feedback and questions are very welcome. Keep an eye out for updates and improvements in the 1.9 release!

Flash “Gist” Your Documents

Topic extraction goes further in summarizing than entity extraction or categorization because /topics is not constrained by a finite list of recognized entity types or categories.

In its most basic use, topic extraction allows users to quickly review a list of keyphrases and concepts to get the gist of an article or document. On a macro level, the same principle can be applied to a corpus of documents to understand what ideas are most common amongst them. Knowing the keyphrases and concepts in each document enables users to automatically tag, sort, and organize their data, making it more useful to analysts and database managers.

Taking topic extraction a step further, users can discover trending topics and track how they change over time. For example, marketers and product managers can analyze customer requests and complaints, as well as assess whether their campaigns or new products are shifting customer opinions. Government analysts can follow changes in public opinion and intelligence reports to power anticipatory intelligence systems. Content recommendation engines can automatically rotate suggestions to subscribers according to public interest in addition to personal preferences.

Topic extraction in action

Take the following excerpt from a Voice of America article about the opioid abuse epidemic:

US Attorney General: The Opioid Crisis is America's 'Top Lethal Issue'

U.S. Attorney General Jeff Sessions called the opioid crisis "America's top lethal issue" Tuesday, saying that a "comprehensive antidote" was needed to address the crisis.

Speaking from the National Alliance for Drug Endangered Children national conference in Green Bay, Wisconsin, Sessions thanked the audience for their work in making the crisis' effects on children known.

“Our country, despite the record deaths, I don’t think has fully recognized the damage this addiction nightmare is doing to us," said Sessions. "And as you understand this epidemic is taking a heavy toll on the most innocent and vulnerable — our children. And yet, in the national conversation about drug abuse, these children are too often forgotten.”

Sessions said that the solution has "three-pillars" — prevention, enforcement, and treatment. Sessions added that the prevention step in particular had been discussed at a meeting with top officials, including State Secretary Rex Tillerson, and White House Chief of Staff John Kelly the day before.

Earlier this month, President Donald Trump vowed that the U.S. would "win" the battle against the heroin and opioid plague, but he stopped short of declaring a national emergency as his handpicked commission had recommended.

The /topics endpoint identifies the following keyphrases:
  • “State Secretary Rex Tillerson”
  • “Top Lethal Issue”
  • “U.S.”
  • “Drug Endangered Children national conference”
  • “crisis”
  • “U.S. Attorney General Jeff Sessions”
  • “US Attorney General”
  • “opioid plague”
  • “Opioid Crisis”
…and the following concepts:
  • “Substance abuse”
  • “Rex Tillerson”
  • “Harm reduction”
  • “Heroin”
  • “Controlled Substances Act”
  • “Opioid”
  • “Jeff Sessions”
  • “Drug policy reform”
  • “Infinite Crisis”

The keyphases found include entities like the person “State Secretary Rex Tillerson” and places like “U.S.,” but also recognizes that more abstract keyphrases like “Opioid Crisis” are central to the text. The concepts list goes a step further, recognizing that the excerpt is about “substance abuse” and “drug policy reform,” although neither theme is explicitly stated in the text.

The returned keyphrases and concepts are ranked based on their salience, or relative importance of the phrase to the overall topic of the text. Currently salience scores are not exposed to the end user, but expect them in a future release when /topics goes from “labs” to fully supported.

Note: The concept extraction feature of the /topics endpoint is designed for documents, not short string text like social media posts. Because concept extraction is intended to extrapolate from the given text, short string calls will return very noisy data results including many false positives.

Try it out

Topic extraction can help you summarize and extract key information from articles and documents, and automatically tag them for improved content management and document search. Ready to try it out? Head over to developer.rosette.com to sign up for a free Rosette Cloud key (no credit card required) for up to 10,000 calls per month. Let us know what you think!


*Clive Thompson Smarter Than You Think