June 9, 2015

Unicode woes in Javascript

Unicode Woes in Javascript | Mixmax

At Mixmax, we send a lot of email. So we run into just about every unicode issue. This post describes a key learning about iterating strings in Javascript and why it's so often done wrong.

Javascript strings are all stored internally as UTF-16 encoded unicode. This gives first-class support for fun characters ike 😜, 🍯, 🐼. The Basic Multilingual Plane includes characters and symbols used by most modern languages. Whereas emoji characters are contained in the Supplementary Multilingual Plane. Characters in the BMP are represented by a single 16-bit code. Non-BMP characters are represented by an ordered pair (called a Surrogate Pair in unicode vocabulary) of two 16-bit codes. Even though non-BMP characters are human readable as a single character, Javascript's internal storage still treats them as two characters. This may lead to unexpected problems when iterating the characters in a string.

Lets look at the simple example of a string containing just one character, the HEAR-NO EVIL MONKEY chracter: 🙉.

var foo = '🙉';

foo.length // what is this string's length?

Normally String.length property returns the number of characters in the string. However, reading the MDN docs on String.length, the property actually returns the "number of code units in the string." Non-BMP characters are stored as two code units and so the length property will return 2 instead of 1. Try it out for yourself.

When iterating a string, you are actually iterating code points. This will cause your code to have unexpected results if you aren't careful. Take for example this code that reverses a string.

var input = 'This is my string 🙉';

function reverse(input) {
  var output = ''
  for (var i=0; i<input.length; i++) {
    output = input[i] + output;
  }
  return output;
}

reverse(input) // => '?????? gnirts ym si sihT'

For strings that contain only single code point characters (BMP characters) the function works fine. But when we have a character that is represented as two code points, reversing them puts the surrogate pair in the wrong order and the character becomes unreadable. The key to reversing strings correctly is to detect non-BPM chracters, reverse their order first, then reverse the entire string. Check out @mathias's great esrever module inspired by rapper/computer scientist Missy Elliot.

Too understand better understand Javascript's handling of non-BMP charaters, read Mathias Bynens excellent post Javascript has a Unicode problem.

Want to work on interesting problems like these? Email careers@mixmax.com and let's grab coffee!

You deserve a spike in replies, meetings booked, and deals won.

Try Mixmax free