My friends and I built Emoji Salad, an Emoji Pictionary game played via SMS. Our backend is built in Node.js, and core game functionality requires parsing strings that contain emoji.
The terms you want to digest are the following:
- Code point — A numerical representation of a specific Unicode character.
- Character Code — Another name for a code point.
- Decimal — A way to represent code points in base 10.
- Hexadecimal — A way to represent code points in base 16.
Let’s demonstrate with an example. Take as our specimen, the letter A.
Sad sack A. Cheer up pal, we’re about to turn you into a code point!
The letter A is represented by the code point 65 (in decimal), or 41 (in hexadecimal).
codePointAt and fromCodePoint are new methods introduced in ES2015 that can handle unicode characters whose UTF-16 encoding is greater than 16 bits, which includes emojis. Use these instead of charCodeAt, which doesn’t handle emoji correctly.
Here’s an example of using these methods, courtesy of xahlee.info:
If the character code is less than 4 characters, it must be left padded with zeros.
Originally, the range of code points was 16 bits, which encompassed the English alphabet (now known as the Basic Multilingual Plane). Now, in addition to that original range, there are 16 more planes) (17 total) to choose from.
The rest of the planes beyond the BMP are referred to as the “astral planes”, which include emoji. Emoji live on Plane 1, the Supplementary Multilingual Plane.
And the Consortium said, let there be emoji
What do you think the following will produce?
If you said 1, you are mistaken my friend! The correct answer is 2.
So for instance, 0x1F600, which is 😀, is represented by:
(The first pair is called the lead surrogate, and the latter the tail surrogate.)
So, how do we get the surrogate pair? There’s a great explanation here, and here’s a gist illustrating going from emoji to decimal to surrogate pair and back again:
Luckily, the internet is awash in smarter folks than I. The lodash library has produced a rock solid emoji regular expression. Is is:
Woof, that’s a monster! Still, we’re enterprising programmers, we’re not afraid of a little regex, right? Let’s reverse engineer this.
From the Wikipedia emoji entry, there’s a couple ranges of emoji (many of which have unassigned values, presumably for future emoji):
- Dingbats (U+2700 to U+27BF, 33 out of 192 of which are emoji)
- Miscellaneous Symbols and Pictographs
(U+1F300 to U+1F5FF, 637 of 768 of which are emoji)
- Supplemental Symbols and Pictographs
(U+1F900 to U+1F9FF, 80 out of 82 of which are emoji)
- Emoticons) (U+1F600 to U+1F64F)
- Transport and Map Symbols (U+1F680 to
U+1F6FF, 92 out of 103 of which are emoji)
- Miscellaneous Symbols (U+2600 to U+26FF, 77 out of 256 of which are emoji)
To make this easier, I’m assuming anything in those ranges is emoji. Our audience uses the English alphabet over SMS, so tough luck if I trawl up any other unsuspecting characters.
They range from U+2700 to U+27BF, so the regular expression for that looks like:
These range from U+1F300 to U+1F5FF, with the following surrogate pairs:
The regex for this range, from lodash’s implementation, is:
From U+1F900 to U+1F9FF, with the following surrogate pairs:
We can reuse the same regex as above:
From U+1F600 to U+1F64F, with surrogate pairs:
Also covered by that same regex:
Includes U+1F680 to U+1F6FF, with surrogate pairs:
Also covered by that same regex:
Includes U+2600 to U+26FF, with surrogate pairs:
We can write a regex for this like so:
There’s another section in the beginning of that original lodash regex we haven’t looked at yet:
If we examine what those characters represent, we get:
So that’s a good section to keep around. The regex so far is:
I’m relying on Emoji-data’s json to provide a library of every emoji. When we run this regular expression against that list, we get 746 matches, 99 misses. Let’s go through the misses:
(That middle “\uFE0F’ is optional, by the way.)
These are covered by the following:
Towards the bottom of the Unicode Block Emoji entry on Wikipedia is the following:
Additional emoji can be found in the following Unicode blocks: Arrows) (8 codepoints considered emoji), Basic Latin) (12), CJK Symbols and Punctuation (2), Enclosed Alphanumeric Supplement(41), Enclosed Alphanumerics (1), Enclosed CJK Letters and Months (2), Enclosed Ideographic Supplement (15), General Punctuation (2), Geometric Shapes (8), Latin-1 Supplement) (2), Letterlike Symbols (2), Mahjong Tiles) (1), Miscellaneous Symbols and Arrows (7), Miscellaneous Technical (18), Playing Cards (1), and Supplemental Arrows-B (2).
Why the heck are these other random emoji scattered around like detritus? I believe the reason is: “because of history”. But I don’t really know. If you know, leave a comment and educate us all!
I won’t go through these one by one. You can look in my Github repo for a breakdown of the regex for each block. Suffice to say the regex that covers all these pesky buggers is:
Which means that… drum roll… the final regex for parsing emojis is:
Hopefully that dispells some of the confusion around parsing emoji.