11 Feb 2019
Case Study

Media Monitoring at Global Scale

Sensika’s 360 intelligence monitors 10x the media with 1/10th the human labor


Sensika is a “media seismometer” for its clients, aiming to detect the subtle “tremors” that most media monitoring tools might miss. Whether it’s knowing a company is in trouble two weeks before it files for bankruptcy,1 or surfacing pricing complaints two hours after a product launch,2 clients rely on Sensika to find relevant media mentions in the markets that matter to their business.

Based on past experiences, Sensika’s founders knew that taking a manual approach to media monitoring was a dead end. No number of humans could review the volume of content that clients require to be processed in an accurate, efficient, and timely manner.

“Big companies navigate like submarines. Using sonar, they send out a signal and listen for the feedback,” Christoff said. “Correct data that comes late is useless for them. They need correct, useful, and timely data.”

By taking an AI approach, Sensika can review more media, with higher accuracy, in a time window that is useful to their clients. However, the proposition of developing a full AI stack from scratch creates a high barrier to entry. So they turned to a vendor to help deliver the quality of results that they need, in the languages that matter to their clients. Rosette, with its complete pipeline of text analytics in 30+ languages, offered an ideal foundation for Sensika to build their advanced media monitoring algorithms.

Just What Marketing Needs: Sensika

Prior to founding Sensika, the founders were involved in a project working with a team of people providing social media and 360 intelligence to two large clients, a petrochemical conglomerate and a pharmaceutical giant. The clients needed any negative signals classified and reported in a timely fashion. As the volume of data exploded with the rise of the Internet, the team—hundreds of people strong—could just barely cover a few thousand source websites.

“To aggregate, normalize, and unify this data, we were always late because of the pre-processing and then because people had to analyze the information,” Christoff said. “We had to be picky about the volume of content we let into processing, so we missed a lot and became more and more irrelevant.”

The financial means were available to try every social listening and media intelligence monitoring platform on the market, and Christoff did.

“Big companies navigate like submarines. Using sonar, they send out a signal and listen for the feedback. Correct data that comes late is useless for them. They need correct, useful, and timely data.”

From that experience, Christoff and his co-founders started Sensika in 2012, which today monitors 900,000+ websites, social media, print, radio, and TV in near real-time with a team of only 50. Harvesting data from these multiple sources daily, Sensika provides a wide range of metrics and alerts, including product intelligence, topic analysis, digital channel performance management, early crises alerting, campaign ROI, R&D intelligence, and Voice of the Customer analysis. Marketing and PR departments and agencies rely on the “360 intelligence” view that Sensika delivers to make decisions from what products to offer—vis-à-vis competitors—to critical pricing adjustments just days after a product launch.

The Challenge

What Christoff learned when he tried all the commercial solutions was that data and its pre-processing were key to getting the actionable signals that clients wanted. How is my product perceived? Is it affordable or expensive? What features have we seemed to nail exactly? Is there a PR crisis brewing?

Unlike other media monitoring providers, Sensika does their own data harvesting and specialized pre-processing of the source data. Part of this critical pre-processing is metadata extraction, including:

  • Location and timestamp of the news
  • The social media user who posted
  • Mentions of people, places, and organizations in the text
  • Topics mentioned (e.g., Davos conference, G7 Summit, the launch of the newest iPhone, etc.)
  • Key phrases
  • Concepts
  • 60+ more types of metadata

To automate the data collection, analysis, and reporting, Sensika sought:

  • Reliable entity extraction (i.e., finding mentions of people, products, places, and organizations)
  • Foundational text analytics (i.e., the ability to tokenize text into words and normalize characters)
  • Broad language support, particularly for processing complex Arabic script languages

The Solution

In the Sensika pipeline, the entities extracted are used to filter search results and drill down to find insights. For example, a search on a new iPhone model will be displayed with filters dynamically generated based on the brands and companies appearing in the results. Thus still-unknown competitors and comparisons against the iPhone are revealed. The data is then classified and tagged with entity-level sentiment analysis. Sensika’s proprietary knowledge graph uncovers relationships between entities and is constantly updated in near real-time.

In this way, Sensika is able to uncover sometimes startling revelations. Christoff relates a period when the stock market news was relentlessly negative. But out of thousands of reports from stock market exchange news, Sensika detected one petrochemical firm that was getting positive press—but it was buried as the fourth or fifth section of a multi-topic article hidden in the back pages of search results. Sensika was able to report on this finding in 4-6 hours to the client, who then had human analysts confirm the report. “Stock exchange people want correct, precise and truthful information fast—always,” Christoff said. “This [example] creates huge credibility for our technology.”

When Sensika started looking for entity extraction, they considered open source NLP packages, but while they were good for English, support for other languages wasn’t enough for their needs.

“Our clients are global and especially interested in news that’s related to their business or activity. We have commercial and government customers in Europe and the Middle East, so data harvesting shifts our language analytics requirements.” Christoff said. “ The multilingual coverage that Basis Technology provides is perfect for a company like ours who serves clients with global needs.”

“Our clients are global…so data harvesting shifts our language analytics requirements. The multilingual coverage that Basis Technology provides is perfect for a company like ours who serves clients with global needs.”
— Konstantin Christoff, Sensika CEO & Co-Founder

Rosette provided the accuracy and breadth of language coverage—particularly for Arabic, Urdu, Persian, and Turkish—that Sensika needed.

“We looked at the cost of developing it ourselves, and clearly we lost and bought it from you [Basis Technology],” Christoff laughed. “We prefer to benefit from great algorithm providers like you guys, but stop at some point and decide what is really strategic to develop ourselves.”

Rosette’s foundational linguistic analysis allows for much more complex and precise insight extraction.

“These capabilities contribute substantially to distinguish us from the more lightweight providers who rely on feeds from the same source presented only with different ‘pretty UIs,’” Christoff said. “We are a tech provider rather than a UI provider.”

End Notes
1. https://sensika.com/use-case-6-investment-early-crisis-detection/
2. https://sensika.com/use-case-2-product-intelligence-pricing/

Spotlight: Why is Pre-Processing Important for Arabic Search?

In searching English, some basic processing includes tokenizing text into words, usually based on white space and punctuation, with special handling for words like “C++” and “AT&T.” A step further is “stemming” whereby related words are linked by a common stem so that “runs” and “running” stem to “run.”

In French, the é (Unicode codepoint value: U+00E9) can be represented by a single character or the “e” (U+0065) plus the combining accent ́ (U+0301). Keep in mind that search engines are comparing the code point value of characters (not the character you see on screen), so U+00E9 ≠ U+0065 U+0301 unless it’s been told they are equivalent. The process of character normalization converts all characters of various representations to the standard representation so that all occurrences of one character (e.g., é) will match.

That’s all fine and well for English and French, but Arabic is literally a whole other world.

Besides the fact that Arabic characters may have a different form depending on if it appears at the beginning, middle, or end of a word, the various ways that characters combine further confuses the issue.

Rosette’s Arabic character normalization formula encompasses 14 categories of character variations that don’t affect meaning but impede search. Just one example is three variations of the character yeh with hamza above that Rosette normalizes to a fourth canonical version.
Yeh with hamza above: The following combinations are converted to ئ (U+0626).
ی (U+06CC) combined with hamza above (U+0654)
ى (U+0649) combined with hamza above (U+0654)
ي (U+064A) combined with hamza above (U+0654)
Those motivated can do yet more processing of Arabic roots, lemmas, and stems that are returned by Rosette’s Arabic analyzer.
The Arabic language adds prefixes, suffixes, and affixes (appearing in the middle of a word) around a root, which may be just three characters. Words distantly related share the same root. Words more semantically similar are likely to share a lemma or stem.
For example, the words for ”book” (kitaab) and “books” (kutub) share the same root (k-t-b)
On the other hand, كُتُب (kutub, books) is an irregular form and does not have the same stem as كِتَاب (kitaab, book). But both forms do share the same lemma, which is the singular form كِتَاب (kitaab).

No amount of chopping off prefixes and suffixes will let you search for “books” and “book at the same time in Arabic without intelligent and sophisticated pre-processing.