A Text Analytics Recipe For Document Summarization

Analytics with Rosette doesn’t end with an endpoint; combine multiple capabilities together for intriguing insights and deeper functionality

Basis Technology  prides itself on being an agile and responsive organization that always gives the customer our ear. Not surprisingly, one of the most common requests we hear is for “more”: more endpoints, more languages, more capabilities. Often, when speaking with users, we realize that we can already do that ‘more’ by combining endpoints together.

For example, when used together in combination with our Rosette API Python binding, the entities and morphology/lemmas endpoints can be used to create summaries. By ranking the richness of the content extracted by Rosette in individual sentences, the script can identify the “meat” of the document and its key ideas. While it may not be the beautifully polished summary you would find inside a book jacket, this concoction of Rosette API endpoints provides the gist of the original content, in essence, a summary.

How it works

Our summarization algorithm script takes in sentences, words, and named entities extracted from a document by Rosette API. It then assigns each sentence a score based on the density of “contentful” words and named entity mentions it contains. Sentences are normalized so that a ten word sentence with five “contentful” words (50%) will rank lower than a five word sentence where every word is high-content (100%).

The script also takes into account a sentence’s location. We won’t get into the details of “logarithmic falloff” but suffice to say that sentences that appear near the beginning of the document are ranked higher than sentences at the end. The script then outputs the most contentful sentences as a summary paragraph. Check out the code samples below and read on for download and installation instructions.

Summarization in action

You can summarize plain text documents…

$ ./summarize.py -k $ROSETTE_USER_KEY -i path/to/your/file.txt

…or URIs. Rosette API automatically extracts the content.

$ ./summarize.py -k $ROSETTE_USER_KEY -u -i "http://www.csmonitor.com/Science/2016/1209/How-dust-changed-scientists-view-of-Saturn-s-C-ring"

In the example above, the document summarized is rather long. We’ve set the script to reduce the document to 50% of its original length by default (based on the number of sentences) but you can specify a different percentage with the -p/–percent option, or specify a specific number of sentences the script should return with the -n/–top-n option.

For example, here we limited the summary to ten sentences:

$ ./summarize.py -k $ROSETTE_USER_KEY -n 10 -u -i "http://www.csmonitor.com/Science/2016/1209/How-dust-changed-scientists-view-of-Saturn-s-C-ring"
The secret to understanding Saturn's C ring? 
Saturn's icy moon Mimas is dwarfed by the planet's enormous rings.
Scientists at Cornell University in Ithaca, N.Y., have been using data from NASA's Cassini mission to Saturn, particularly its microwave passive radiometer, to study the planet's rings. 
The rings are mostly composed of ice, but "it is the small fraction of non-icy material – the dust the ring collects – that is valuable for clues about the ring's origin and age," doctoral candidate Zhimeng Zhang, who led the work, told the Cornell Chronicle.
Dust drifts through space from beyond the Kuiper Belt and hits Saturn's rings. 
The older a ring is, therefore, the more dust it will have time to collect. 
And scientists can analyze the dust to figure out how old the ring is.
It collides with Saturn's rings, and sticks to them. 
Zhang and her fellow researchers believe that the C ring has been "continuously polluted" by these space dust particles.
When instruments like Cassini's microwave passive radiometer measure a ring's thermal emissions, dustier rings will have higher readings.

You can also enable the “verbose” -v/–verbose option to see how the script scored each sentence in the summary. This option outputs results as an Annotated Data Model (ADM). The ADM models a document in JSON format. The script also augments the ADM with a summary attribute that includes the rank scores for each sentence, which indicate their relative “contentful-ness”:

$ ./summarize.py -k $ROSETTE_USER_KEY -u -i "http://www.csmonitor.com/Science/2016/1209/How-dust-changed-scientists-view-of-Saturn-s-C-ring" -n 10 -v | jq .attributes.summary
{
 "ranked": [
 {
 "startOffset": 0,
 "endOffset": 45,
 "text": "The secret to understanding Saturn's C ring? ",
 "score": 29.100689277811085,
 "tokenLength": 9
 },
 ...,
 {
 "startOffset": 3199,
 "endOffset": 3205,
 "text": "Daily",
 "score": 0,
 "tokenLength": 0
 }
 ],
 "summary": "The secret to understanding Saturn's C ring? \nSaturn's icy moon Mimas is dwarfed by the planet's enormous rings.\nScientists at Cornell University in Ithaca, N.Y., have been using data from NASA's Cassini mission to Saturn, particularly its microwave passive radiometer, to study the planet's rings. \nThe rings are mostly composed of ice, but "it is the small fraction of non-icy material – the dust the ring collects – that is valuable for clues about the ring's origin and age," doctoral candidate Zhimeng Zhang, who led the work, told the Cornell Chronicle.\nDust drifts through space from beyond the Kuiper Belt and hits Saturn's rings. \nThe older a ring is, therefore, the more dust it will have time to collect. \nAnd scientists can analyze the dust to figure out how old the ring is.\nIt collides with Saturn's rings, and sticks to them. \nZhang and her fellow researchers believe that the C ring has been "continuously polluted" by these space dust particles.\nWhen instruments like Cassini's microwave passive radiometer measure a ring's thermal emissions, dustier rings will have higher readings. ",
 "info": "maintained 10 sentences (27% of original sentences)"
}

Try it yourself

With this summarization script in hand, you’re fully equipped to address a range of common data analysis needs. Automatically craft short summaries to speed up intelligence analysts’ workflows, provide a snapshot of product information to e-commerce customers, or quickly extract key information from dense research documents.

Ready to get started? First, sign up for a free API key (no credit card required) for up to 10,000 calls per month. Next, visit our Community Github for step by step instructions on installing and running the script. Summarization is supported in 19 languages: Arabic, Chinese (simplified and traditional), Dutch, English, French, German, Hebrew, Indonesian, Italian, Japanese, Korean, Malay, Pashto, Persian, Portuguese, Russian, Spanish, and Urdu.

Thought of another way to combine Rosette API endpoints for a new use case? Let us know and we’ll feature you on our blog!

Header photo credit: Color-enhanced image of Saturn’s atmosphere and rings from Voyager 1, NASA