Internationalisation and Unicode
The
mbstring.func_overload
Configuration Directive
In the introduction, we have already explained why the successor of PHP 5 is PHP 7 rather than PHP 6. Since the attempt to create a unicode-based PHP implementation has failed, PHP 7 –just like PHP 5– does not handle Unicode strings natively. Calculating the string length is trivial for ASCII characters: just count the number of bytes. Calculating the length of a string that is encoded using UTF-8, however, is more challenging, since UTF-8 is a variable-length encoding and each character (code point, to be exact) is represented by one to four bytes. For ASCII characters, everything works smoothly, because UTF-8 is a superset of ASCII. The problems start with non-ASCII characters:
var_dump(strlen('ö'));
This simple script, at least when saved as UTF-8, will produce a most interesting result:
int(2)
When encoding the one German umlaut as UTF-8, two bytes are being
used. Since PHP does not know about UTF-8 (or Unicode in general),
the built-in strlen()
function just counts bytes, which
leads to a wrong result.
There are commonly used PHP extensions, for example
iconv
or mbstring
(“multibyte string”)
that offer Unicode-enabled string handling functions, for example
mb_strlen()
(which, of course, requires the
mbstring
extension):
var_dump(mb_strlen('ö'));
This function counts code points rather than bytes and thus yields the correct result:
int(1)
You can do the same with the iconv
extension:
var_dump(iconv_strlen('ö'));
Unsurprisingly, this function yields the same result:
int(1)
In both cases, we are cheating a little, since we are not specifying that our string is UTF-8 encoded. This works since by convention UTF-8 is the assumed default encoding pretty much everywhere on the Internet.
Now we will add magic into the mix, and new problems will arise.
If you are using the mbstring
extension then you can
use the php.ini
directive
mbstring.func_overload
to overload built-in PHP
functions with the multibyte-enabled mb_
functions.
Depending on the value you set mbstring.func_overload
to, the mail()
function, string functions, and regular
expressions (unfortunately not the preg_
ones, but the
removed ereg_
ones) can be overloaded.
The problem with this magic is that your program cannot know
whether PHP’s string functions operate with or without support for
multibyte characters. And you certainly do not want to wrap an
if
around every string function. So just like with
magic quotes, which we wrote about earlier, using
mbstring.func_overload
is not a good idea. That is why
this php.ini
directive has been deprecated in PHP 7.2,
and will likely be removed in PHP 8.
Even if it potentially means a lot of work: you have to walk through your code and make it explicit with which encodings you work. Do not wait until PHP 8, because that would put you in a situation where you cannot upgrade to PHP 8. You effectively want to get started with your PHP 8 migration right now.
Third
Parameter to mb_strrpos()
Similar to the built-in strrpos()
, this function
finds the last occurrence of a given substring in a multibyte
string. In PHP 5.2, a new additional optional third parameter
offset
has been introduced, which moved the optional
encoding parameter to fourth position.
To keep backward compatibility, however, it was still possible to specify the encoding as the third parameter. Since the encoding is a string, and the offset is an integer, PHP was able to distinguish between the two.
With PHP becoming more and more type-safe, these kinds of “hacks” are no longer desirable, because, among other things, they confuse static code analysis tools.
var_dump(mb_strrpos('haystack', 'tac', 'UTF-8'));
PHP Deprecated:
mb_strrpos(): Passing the encoding as third parameter is deprecated. Use an explicit zero offset in ...
int(4)
The fix is simple, and conveniently the deprecation message
already told you: find all calls to mb_strrpos()
that
have three parameters, and insert 0
as third
parameter:
var_dump(mb_strrpos('haystack', 'tac', 0, 'UTF-8'));
Deprecate
money_format()
Most applications, at least at some point, will have to deal with money. Even if only one currency is supported, number formats still differ. Different thousand separators, decimal points, or even groupings are used. Sometimes the currency sign is shown before the number, sometimes after.
The built-in function money_format()
formats
monetary values. Technically, this function is a wrapper to the C
library function strfmon
. The problem is that this C
library function is not available on all operating systems. On
Windows, for example, it is not available, which leads to undefined
results when number_format()
is used.
var_dump(money_format('%i' , 19995.90));
Since PHP 7.4, PHP will complain:
PHP Deprecated:
Function money_format() is deprecated in ...
string(8) "19995.90"
In this example, the output depends on whatever locale you have set, any may differ on your system. It is generally not a good idea to rely on (system) locales, because you never know which ones are available on a given target system, and how they exactly behave.
Instead of using money_format()
, you should use
NumberFormatter
, which is part of the intl
extension. It uses the locales that come with the icu
library rather than system locales. In addition, you can pass the
desired locale as a constructor parameter rather than having to set
it globally, which can lead to interesting problems in
multi-threaded environments.
$formatter = new NumberFormatter('de_DE', NumberFormatter::CURRENCY);
var_dump($formatter->formatCurrency(19995.90, 'EUR'));
The result is a nicely formatted string
string(14) "19.995,90 €"
without any deprecation warnings, and without the ugly dependencies on system locales and global state.
By the way, you can list all icu
locales that
available on your system by running:
var_dump(ResourceBundle::getLocales(''));
Deprecate
Undocumented Aliases of mbstring
extension
PHP, originally being a procedural programming language by design, did not have namespaces before version 5.3. Thus underscore-separated prefixes were used as a substitute, after all you do not want functions defined by different PHP extensions to clash.
The functions defined by the mbstring
extension are
prefixed with mb_
. It is a little known fact that
aliases without underscore exist:
- mbregex_encoding()
- mbereg()
- mberegi()
- mbereg_replace()
- mberegi_replace()
- mbsplit()
- mbereg_match()
- mbereg_search()
- mbereg_search_pos()
- mbereg_search_regs()
- mbereg_search_init()
- mbereg_search_getregs()
- mbereg_search_getpos()
- mbereg_search_setpos()
Admittedly, we also did not know about these aliases, to the day when we read that they were deprecated. Luckily, the fix is extremely simple: just add the missing underscore.
Deprecate
Normalizer::NONE
Unicode is a really complex thing, at least once you get past the ASCII character set, which is a subset of Unicode. Many languages have special characters, in German, for example, we have the so-called Umlaute ä,ö, and ü. (We also have a special version of the s, but we will save the fun with that one for another chapter).
There are various ways of combining characters (actually, they
are called code points in Unicode, as we have previously pointed
out). An a-Umlaut can be created as a combination of an
a
and the two horizontal dots, or could directly be
represented as an ‘ä’ character. When sorting or comparing strings,
an ‘ä’ character, however, both should be considered the same, but
since PHP views strings as byte sequences, they are not:
$a = "a\u{0308}";
$b = 'ä';
var_dump($a, $b);
var_dump($a == $b);
This will result in:
string(3) "ä"
string(2) "ä"
bool(false)
As we can see, both strings look the same, but their binary representation differs, as we can clearly tell from the fact that both strings have a different length. Remember, by default PHP counts bytes, not code points!
The widely used ICU library, which is available in PHP through
the intl
extension, features a Normalizer
class which can convert strings to a canonical representation.
Adding
var_dump(Normalizer::normalize($a) == Normalizer::normalize($b));
to the above example does the trick:
bool(true)
There are various modes of normalization that can be passed to
the normalize()
method as the second argument. Going
into details of normalization modes, however, would be beyond the
scope of this book. What is important for you to know is that one
normalization mode, namely Normalizer::NONE
has been
deprecated in PHP 7.4. Trying to use it will yield a deprecation
error:
PHP Deprecated:
Normalizer::NONE is obsolete with ICU 56 and above and will be removed in later PHP versions in ...
Since Normalizer::NONE
, according to the
documentation, does not even normalize the given string, but seems
to be a no-operation, it will probably not be missed much.