Splitting
Multibyte Strings with mb_str_split
Splitting a string, at first glance, seems trivial. Say we want to split a string with six characters into two-character chunks. In an ASCII world this is trivial: create three chunks of two bytes each, since six characters equals six bytes. When it comes to dealing with multibyte character sets, the encoding can be variable-length, for example UTF-8, that means that one character can be represented by one to four bytes. Just like counting bytes does not mean counting characters for a multibyte string, you cannot just cut a sequence of bytes into chunks of equal size, because you might cut in the middle of a character, creating garbage strings as the result.
The new function mb_str_split()
that is part of the
mbstring
extension can split multibyte strings. Let us
look at a simple example. The german word “Straße” (street)
contains a non-ASCII special character, the ß
. Let us
do some splitting:
var_dump(mb_str_split("Straße", 3));
The result is:
array(2) {
[0] => string(3) "Str"
[1] => string(4) "aße"
}
As we can see, the string has been split into two three-character
chunks, however, the second chunk is four bytes in length! That is
because ß
takes two bytes to encode.
Here is an example of what would happen if we were to cut the
two-byte character ß
in halves:
$str = 'ß';
var_dump($str[0]);
var_dump($str[1]);
The result is unpleasing:
string(1) "?"
string(1) "?"
Note that PHP actually shows a special replacement character (U+FFFD), which we had to replace with a plain question mark to work around PDF rendering problems that an invalid character would cause.