New Function mb_scrub()

The most widely used multibyte encoding is UTF-8, since it is the de-facto standard of the world wide web. Let us blatantly ignore the fact that a variety of other multibyte encodings exist, and just focus on some interesting subtleties of UTF-8 to illustrate why the function mb_scrub() has been introduced into PHP’s mbstring extension.

UTF-8 is a variable-length encoding. ASCII is a subset of UTF-8, non-ASCII characters are encoded with two to four bytes. In fact, UTF-8 was designed to strike a balance between length of the binary string and encoding/decoding time. Basically, as long as you transmit only ASCII data, you can claim the data is UTF-8, but there is no additional overhead. This is basically how PHP usually cheats its way through not being able to handle multibyte strings internally.

When dealing with legacy databases, there are usually a few odd database tables or at least columns that contain data of unknown encoding. Legacy PHP code often used to just take whatever input was provided by the client browser and store it into the database. Sometimes, the client would send data in a certain encoding, and the PHP application would store it into a database column that was expecting another encoding. PHP never cared too much, but just moved a sequence of bytes from the browser to the database without ever converting it. This is not to say that every PHP application works or worked that way, but we have seen the “we do not know what the encoding of that data is” problem far too often to consider it an edge case.

The big problem with arbitrary binary data stored as text is that you can only guess what encoding it could be. The only way to determine that something is not UTF-8 is by finding an invalid byte sequence. But what to do with such a string?

The newly introduced function mb_scrub() tries to “repair” multibyte strings by replacing invalid byte sequences with a special “unkown” character, which is usually displayed as a question mark. Note that this changes the string, which may lead to undesired effects, however, it will allow you to take at least the “good” parts of the string and continue to work with it.

Let us create a byte sequence that is not valid UTF-8:

$string = "te\xc3st";

var_dump(mb_scrub($string));

This will result in:

string(5) "te?st"