Splitting Multibyte Strings with mb_str

Splitting Multibyte Strings with `mb_str_split`

Splitting a string, at first glance, seems trivial. Say we want to split a string with six characters into two-character chunks. In an ASCII world this is trivial: create three chunks of two bytes each, since six characters equals six bytes. When it comes to dealing with multibyte character sets, the encoding can be variable-length, for example UTF-8, that means that one character can be represented by one to four bytes. Just like counting bytes does not mean counting characters for a multibyte string, you cannot just cut a sequence of bytes into chunks of equal size, because you might cut in the middle of a character, creating garbage strings as the result.

The new function mb_str_split() that is part of the mbstring extension can split multibyte strings. Let us look at a simple example. The german word “Straße” (street) contains a non-ASCII special character, the ß. Let us do some splitting:

var_dump(mb_str_split("Straße", 3));

The result is:

array(2) {
  [0] => string(3) "Str"
  [1] => string(4) "aße"
}

As we can see, the string has been split into two three-character chunks, however, the second chunk is four bytes in length! That is because ß takes two bytes to encode.

Here is an example of what would happen if we were to cut the two-byte character ß in halves:

$str = 'ß';
var_dump($str[0]);
var_dump($str[1]);

The result is unpleasing:

string(1) "?"
string(1) "?"

Note that PHP actually shows a special replacement character (U+FFFD), which we had to replace with a plain question mark to work around PDF rendering problems that an invalid character would cause.

Splitting Multibyte Strings with mb_str_split

Splitting Multibyte Strings with `mb_str_split`