A smarter approach to linguistic comparison and word clouds

16 Nov 2017
Recipes

New community recipe enables vocabulary comparison and word cloud generation

Every individual has a unique way of speaking and writing based upon their experiences, personal style, and culture. For the data scientist, analyzing, comparing, and visualizing the vocabularies of different texts can reveal valuable insights for applications such as data cleansing and authorship identification or verification.

To help our users quickly and easily dip their toes into the linguistic comparison pool, the Rosette API community team created a recipe for vocabulary comparison as well as a script for word cloud generation enhanced by linguistic analysis. All you need is a Rosette API key!

Linguistic comparison, not character comparison

Word cloud generation is not a rare functionality. Many open source data science tools offer vocabulary visualization, but only count word frequency, stopping short of any linguistic analysis. Our new recipe first uses Rosette API’s morphological analysis to enhance your word cloud with part of speech tags, demonstrated through easy-to-parse color coding.

While still interesting for those new to text analytics, visualization, and exploratory NLP, tools that only look at characters, not words and their meanings have limited value. For example, variations of the same word (play, playing, and played) are all considered unique and will appear separately in the word cloud.

Our new recipe groups different forms of each word through lemmatization, the process of finding the dictionary form of each word. Lemmatizing words means different forms of the same verb, such as “played” and “plays,” are mapped back to the root word, “play,” for cleaner, more accurate results.

Putting vocabulary comparison to work

Vocabulary comparison can have many uses beyond word cloud generation. Documents are often syndicated on multiple platforms. For example, a corporate press release may be published on hundreds of websites. Vocabulary comparison can be used to eliminate identical or near-identical documents in an OSINT database.

Vocabulary comparison across multiple pieces of text is also one of a suite of tools used to enable authorship identification and verification. For example, last year Stack Overflow data scientist David Robinson used a number of linguistic analytics tools to compare tweets from Donald Trump’s Twitter, determining which were authored by the then-presidential candidate and which by his staff.

Vocabulary comparison also allows analysts to better understand the writing or speaking style of different authors – comparing the most common words and types of words used. For example, the following word clouds break down the postgame interviews of Patriots and Falcons head coaches Bill Belichick and Dan Quinn following their October 22nd game:

Football fans may be surprised to see that infamously curmudgeonly Coach Belichick uses the adjective “good” more than any other word, but keep in mind that the Patriots trounced their Super Bowl rival 23-7, plenty of reason for a happy Belichick. We can also see that both Coaches Quinn and Belichick use the verb form of “play” more frequently than they use the noun form of “play.”

Try it yourself

First, sign up for a free Rosette API key (10,000 free calls/month), and head over to our Rosette API Community Github for step by step instructions to input your data and produce vocabulary comparisons.

To see how to create a color coded word cloud, check out the Jupyter notebook. You can also run the notebook locally by running:

(compare-vocabulary) $ jupyter notebook visualize.ipynb

Some corpora of poems by several famous poets are provided as examples. To analyze your own data, add to or replace those subdirectories with directories of your own plain-text files. The size and order of the words indicate how frequently they appear in the data, while color corresponds to part of speech.

Have an idea for an API recipe? Let us know! We’re always looking for ways to help our users quickly and easily derive value from their text data.