How Emoji Reflects Our Evolving Society

Or, what does it mean to lemmatize and normalize emoji for text analytics?
Emoticons ๐ and emoji ๐๐ add a bit of the nonverbal communication that humans inherently crave in our electronic communications. The addition of a winking face ๐ softens a potentially harsh statement or expresses shared camaraderie far more succinctly and immediately than words. So itโs not a surprise that these modern hieroglyphs pop up a lot in social media and text messages [1].
Emoji: a History
Just as language evolves, emoji have, too. Emoji first appeared in 1999 on Japanese mobile phones and in 2007 emoji were approved for addition to the Unicode Standard (to sort out the mess of incompatible emoji sets between the various Japanese mobile phone carriers trying to use the Unicode private use area[2]).
In the beginning, Japanese carriers used a light colored skin tone for all the emoji depicting people or body parts. However, responding to the desire of users to reflect human diversity in their communication, Unicode introduced skin tones for human emoji in Unicode 8.0 (released in mid-2015).
The set of standard emojis have also expanded to reflect diversity in other ways. The emoji โkissโ ๐ (U+1F48F) usually depicts a man and a woman, and was approved as part of Unicode 6.0 in 2010. From there it wasnโt a far leap to users wanting to depict โa man and a man kissingโ or โa woman and a woman kissingโ and Unicode lets one do that with a zero-width joiner (ZWJ). The ZWJ creates a glyph that looks and is treated as a single character, but is actually multiple characters (๐จ Man, ZWJ, โค Heavy Black Heart, ZWJ, ๐ Kiss Mark, ZWJ and ๐จ Man).
Similarly, the single character โkissโ ๐ (U+1F48F) could be represented as (๐ฉ Woman, ZWJ, โค Heavy Black Heart, ZWJ, ๐ Kiss Mark, ZWJ, ๐จ Man). This flexibility created situations where a character can be represented more than one way.
Emoji & Text Analytics
For those in the text analytics world, the addition of skin tones as emoji modifiers, and the ability to depict a character in more than one way, mean that we need a way to canonicalize (or normalize) emoji for efficient processing. Suppose that a feedback analysis application is using these tokens downstream. Does the boy emoji ๐ฆ combined with one of five different skin tones ๐ฆ๐ป๐ฆ๐ผ๐ฆ๐ฝ๐ฆ๐พ๐ฆ๐ฟ really change that itโs representing a boy? In most cases no, but of course if those modifiers are important to the meaning, the surface form can be used as is.
Is there a meaningful difference between โkissโ depicted as one character vs. several? Thatโs about the same as the Japanese katakana โgaโ being represented as single character (ใฌ) versus two characters (ใซ plus ใ)? Probably a meaningless difference in most cases that should be removed.
Rosette Tackles Emoji
So as part of Rosette 1.7, the tokenization and morphological analysis endpoints now support tokenizing and part-of-speech tagging for emoticons and emoji (as well as hashtags, @mentions, emails and URLs). This same functionality is also supported in our on-premise API and SDK. Furthermore, Rosette will lemmatize emoji (removing skin tone and gender modifiers, used with โpeopleโ emoji like โsurferโ) and normalize multi-character emoji in the text stream to single emoji characters where they exist.
We enjoy diversity in our lives, but our language analyzers, not so much ๐
For additional reading, check outย Unicode Technical Report #51 and Emojipedia.orgย .
- See emoji popularity on Twitter tracked here http://www.emojitracker.com/.
- The Unicode private use area is a series of codepoints in the Unicode standard that are not officially assigned characters. Thus users can assign whatever characters they want to these codepoints, but if a document uses the same PUA codepoints to which a different program has assigned other characters, you get an incompatibility clash.