Internationalisation and Unicode

The `mbstring.func_overload` Configuration Directive

In the introduction, we have already explained why the successor of PHP 5 is PHP 7 rather than PHP 6. Since the attempt to create a unicode-based PHP implementation has failed, PHP 7 –just like PHP 5– does not handle Unicode strings natively. Calculating the string length is trivial for ASCII characters: just count the number of bytes. Calculating the length of a string that is encoded using UTF-8, however, is more challenging, since UTF-8 is a variable-length encoding and each character (code point, to be exact) is represented by one to four bytes. For ASCII characters, everything works smoothly, because UTF-8 is a superset of ASCII. The problems start with non-ASCII characters:

var_dump(strlen('ö'));

This simple script, at least when saved as UTF-8, will produce a most interesting result:

int(2)

When encoding the one German umlaut as UTF-8, two bytes are being used. Since PHP does not know about UTF-8 (or Unicode in general), the built-in strlen() function just counts bytes, which leads to a wrong result.

There are commonly used PHP extensions, for example iconv or mbstring (“multibyte string”) that offer Unicode-enabled string handling functions, for example mb_strlen() (which, of course, requires the mbstring extension):

var_dump(mb_strlen('ö'));

This function counts code points rather than bytes and thus yields the correct result:

int(1)

You can do the same with the iconv extension:

var_dump(iconv_strlen('ö'));

Unsurprisingly, this function yields the same result:

int(1)

In both cases, we are cheating a little, since we are not specifying that our string is UTF-8 encoded. This works since by convention UTF-8 is the assumed default encoding pretty much everywhere on the Internet.

Now we will add magic into the mix, and new problems will arise. If you are using the mbstring extension then you can use the php.ini directive mbstring.func_overload to overload built-in PHP functions with the multibyte-enabled mb_ functions. Depending on the value you set mbstring.func_overload to, the mail() function, string functions, and regular expressions (unfortunately not the preg_ ones, but the removed ereg_ ones) can be overloaded.

The problem with this magic is that your program cannot know whether PHP’s string functions operate with or without support for multibyte characters. And you certainly do not want to wrap an if around every string function. So just like with magic quotes, which we wrote about earlier, using mbstring.func_overload is not a good idea. That is why this php.ini directive has been deprecated in PHP 7.2, and will likely be removed in PHP 8.

Even if it potentially means a lot of work: you have to walk through your code and make it explicit with which encodings you work. Do not wait until PHP 8, because that would put you in a situation where you cannot upgrade to PHP 8. You effectively want to get started with your PHP 8 migration right now.

Third Parameter to `mb_strrpos()`

Similar to the built-in strrpos(), this function finds the last occurrence of a given substring in a multibyte string. In PHP 5.2, a new additional optional third parameter offset has been introduced, which moved the optional encoding parameter to fourth position.

To keep backward compatibility, however, it was still possible to specify the encoding as the third parameter. Since the encoding is a string, and the offset is an integer, PHP was able to distinguish between the two.

With PHP becoming more and more type-safe, these kinds of “hacks” are no longer desirable, because, among other things, they confuse static code analysis tools.

var_dump(mb_strrpos('haystack', 'tac', 'UTF-8'));

PHP Deprecated:
mb_strrpos(): Passing the encoding as third parameter is deprecated. Use an explicit zero offset in ...
int(4)

The fix is simple, and conveniently the deprecation message already told you: find all calls to mb_strrpos() that have three parameters, and insert 0 as third parameter:

var_dump(mb_strrpos('haystack', 'tac', 0, 'UTF-8'));

Deprecate `money_format()`

Most applications, at least at some point, will have to deal with money. Even if only one currency is supported, number formats still differ. Different thousand separators, decimal points, or even groupings are used. Sometimes the currency sign is shown before the number, sometimes after.

The built-in function money_format() formats monetary values. Technically, this function is a wrapper to the C library function strfmon. The problem is that this C library function is not available on all operating systems. On Windows, for example, it is not available, which leads to undefined results when number_format() is used.

var_dump(money_format('%i' , 19995.90));

Since PHP 7.4, PHP will complain:

PHP Deprecated:
Function money_format() is deprecated in ...
string(8) "19995.90"

In this example, the output depends on whatever locale you have set, any may differ on your system. It is generally not a good idea to rely on (system) locales, because you never know which ones are available on a given target system, and how they exactly behave.

Instead of using money_format(), you should use NumberFormatter, which is part of the intl extension. It uses the locales that come with the icu library rather than system locales. In addition, you can pass the desired locale as a constructor parameter rather than having to set it globally, which can lead to interesting problems in multi-threaded environments.

$formatter = new NumberFormatter('de_DE', NumberFormatter::CURRENCY);
var_dump($formatter->formatCurrency(19995.90, 'EUR'));

The result is a nicely formatted string

string(14) "19.995,90 €"

without any deprecation warnings, and without the ugly dependencies on system locales and global state.

By the way, you can list all icu locales that available on your system by running:

var_dump(ResourceBundle::getLocales(''));

Deprecate Undocumented Aliases of `mbstring` extension

PHP, originally being a procedural programming language by design, did not have namespaces before version 5.3. Thus underscore-separated prefixes were used as a substitute, after all you do not want functions defined by different PHP extensions to clash.

The functions defined by the mbstring extension are prefixed with mb_. It is a little known fact that aliases without underscore exist:

mbregex_encoding()
mbereg()
mberegi()
mbereg_replace()
mberegi_replace()
mbsplit()
mbereg_match()
mbereg_search()
mbereg_search_pos()
mbereg_search_regs()
mbereg_search_init()
mbereg_search_getregs()
mbereg_search_getpos()
mbereg_search_setpos()

Admittedly, we also did not know about these aliases, to the day when we read that they were deprecated. Luckily, the fix is extremely simple: just add the missing underscore.

Deprecate `Normalizer::NONE`

Unicode is a really complex thing, at least once you get past the ASCII character set, which is a subset of Unicode. Many languages have special characters, in German, for example, we have the so-called Umlaute ä,ö, and ü. (We also have a special version of the s, but we will save the fun with that one for another chapter).

There are various ways of combining characters (actually, they are called code points in Unicode, as we have previously pointed out). An a-Umlaut can be created as a combination of an a and the two horizontal dots, or could directly be represented as an ‘ä’ character. When sorting or comparing strings, an ‘ä’ character, however, both should be considered the same, but since PHP views strings as byte sequences, they are not:

$a = "a\u{0308}";
$b = 'ä';

var_dump($a, $b);
var_dump($a == $b);

This will result in:

string(3) "ä"
string(2) "ä"
bool(false)

As we can see, both strings look the same, but their binary representation differs, as we can clearly tell from the fact that both strings have a different length. Remember, by default PHP counts bytes, not code points!

The widely used ICU library, which is available in PHP through the intl extension, features a Normalizer class which can convert strings to a canonical representation. Adding

var_dump(Normalizer::normalize($a) == Normalizer::normalize($b));

to the above example does the trick:

bool(true)

There are various modes of normalization that can be passed to the normalize() method as the second argument. Going into details of normalization modes, however, would be beyond the scope of this book. What is important for you to know is that one normalization mode, namely Normalizer::NONE has been deprecated in PHP 7.4. Trying to use it will yield a deprecation error:

PHP Deprecated:
Normalizer::NONE is obsolete with ICU 56 and above and will be removed in later PHP versions in ...

Since Normalizer::NONE, according to the documentation, does not even normalize the given string, but seems to be a no-operation, it will probably not be missed much.