Rosette Base Linguistics is now Cloudera CertifiedApril 22, 2014
[vc_column_text width=”1/1″ el_position=”first last”]
What is Cloudera?
To explain what Cloudera is, we have to briefly mention Hadoop.
Hadoop is the open source application built from a few papers Google engineers published on their MapReduce infrastructure designed to handle very large datasets on a very inexpensive, distributed infrastructure. Facebook and many other companies use it, and Google and IBM are funding efforts to teach it to students.
The guys from Cloudera saw an opportunity and released their own version of Hadoop. The software remains free, but Cloudera adds support and consulting services for the software.
Cloudera enables its customer base, which includes top biotech, oil and gas, retail and insurance companies, to make more out of their information for less.
How does Basis Technology fit in?
80% of that Big Data is what is often referred to as “Big Text” and text analytics expertise is where we come in.
Cloudera Search is based on Apache Solr – the enterprise standard for open source search. Cloudera Search brings scale and reliability for a new generation of search – “big text” search. It extends the value of Apache Solr and gains the same fault tolerance, scale, visibility, and flexibility provided to other workloads, like Apache Hive and Cloudera Impala.
Rosette Base Linguistics (RBL) integrates seamlessly with Cloudera Search to provide a complete set of linguistic services in over 40 languages. RBL enriches the original text in its native language for dramatically improved search quality and relevance.
“Without the right solution in place, the challenge of tackling big data search can be very daunting, especially with the vast amounts of data that spans borders and languages,” said Tim Stevens, Vice President, Business Development, Cloudera.
“Basis Technology and its proven Rosette platform are a great complement to our search framework and uniquely position us to ensure our customers continue to receive the best data querying services, regardless of its native language, or whether it is structured or unstructured.”
“We pride ourselves on extracting meaningful intelligence from unstructured multilingual text by developing the industry’s best linguistics software. Partnering with Cloudera aligns us with a proven leader in big data, and keeps us at the forefront of innovation,” said Carl Hoffman, CEO of Basis Technology.
“The need to quickly and accurately analyze unstructured text is paramount for corporations and governments to remain competitive and relevant. We look forward to working with Cloudera to address these constantly evolving challenges.”
What does it mean for Cloudera users?
The partnership and certification brings our respective customers, best in class, full-text, interactive search and scalable indexing to Apache Hadoop™ on the leading open source Platform for Big Data, Cloudera Distributed Hadoop (CDH).
- Advanced Morphological Features
- Simple API
- High-scale and Throughput
- Industrial-strength Support
- Easy Installation
- Flexible and Customizable
- Platform: Unix, Linux, Mac, Windows
- Component of the Rosette SDK
- Customizable user dictionaries, Japanese orthographic normalization, and Chinese scripts
Many search tools use bigrams to understand languages written without spaces between words. This results in a larger index size and a reduction in relevancy. RBL, in contrast, accurately identifies and separates each word through advanced statistical modeling. The resulting token output (also known as segmentation) minimizes index size, enhances search accuracy, and increases relevancy.
Most search engines utilize a crude method of chopping off characters at the end of a word in the hopes of finding the root form. This method, called stemming, often results in extra recall and poor precision. Instead, RBL finds the true dictionary form of each word, known as a lemma, by using vocabulary, context, and advanced morphological analysis. Indexing the root form increases search relevancy and slims the search index by not indexing all inflected forms. Alternative lemmas are also made available to supplement indexing.
Noun Phrase Extraction
Certain nouns, especially proper names, can be very tricky to identify as a single entity. RBL groups the nouns and their modifiers, which is useful in document clustering and concept extraction.
Parts of Speech Tagging
As part of the lemmatization process, statistical modeling is used to determine the correct part of speech, even with ambiguous words. Each token is then tagged for enhanced comprehension and search relevancy.
RBL breaks down compound words into sub-components and delivers each individual element to be indexed. This is especially useful for increasing search relevancy in languages such as German and Korean.
Samstagmorgen is a compound word formed with Samstag (Saturday) and morgen (morning). Decompounding allows for an appropriate match when searching for “Samstag”.
The start and end of each sentence is automatically identified even though punctuation use may be ambiguous.
Contact us about integrating Rosette Base Linguistics into your Cloudera application: 617 386 2090