PHP: Going multibytes

spO0q 🐒🎃 - Aug 11 - - Dev Community

Multibyte characters can be tricky in programming.

Warning

mbstring is not enable by default. Ensure you read that part before.

Why bother with multibyte strings?

A document can contain multibyte strings. While PHP has plenty of useful helpers for strings, these helpers are simply not meant for multibyte strings.

It will likely cause nasty bugs and other unexpected errors, especially when you count chars.

That's why you'd rather use Multibyte String Functions in PHP instead.

Besides, new multibyte string functions, such as mb_trim, mb_ltrim, and mb_rtrim will be available in 8.4 (the next release of PHP at the time of writing).

Why do some characters require multiple bytes?

English uses the ASCII character set, so letters like r or s only require one byte.

In contrast, some languages use characters that need more than one byte, for example, Han characters (it can be up to 6 bytes!).

A few examples

Count chars

$strings = [
    "😀😃😄😁😆",
    "チャーミング",
    "González",
];

foreach ($strings as $string) {
    echo 'strlen:' . strlen($string) . ' vs. mb_strlen:' . mb_strlen($string) . PHP_EOL;
}
Enter fullscreen mode Exit fullscreen mode

Find position

echo strpos("チャーミング", "ャ"); // gives 3
echo mb_strpos("チャーミング", "ャ"); // gives 1 because 1st position is 0
Enter fullscreen mode Exit fullscreen mode

Cut string

echo substr("チャーミング", 3) . PHP_EOL;// ャーミング
echo mb_substr("チャーミング", 3);// ミング
Enter fullscreen mode Exit fullscreen mode

Impact on performance

You might read that mbstring functions can have a significant impact.

You may even reproduce it with the following script:

$cnt = 100000;

$strs = [
    'empty' => '',
    'short' => 'zluty kun',
    'short_with_uc' => 'zluty Kun',
    'long' => str_repeat('this is about 10000 chars long string', 270),
    'long_with_uc' => str_repeat('this is about 10000 chars long String', 270),
    'short_utf8' => 'žlutý kůň',
    'short_utf8_with_uc' => 'Žlutý kŮň',
];

foreach ($strs as $k => $str) {
    $a1 = microtime(true);
    for($i=0; $i < $cnt; ++$i){
        $res = strtolower($str);
    }
    $t1 = microtime(true) - $a1;
    // echo 'it took ' . round($t1 * 1000, 3) . ' ms for ++$i'."\n";

    $a2 = microtime(true);
    for($i=0; $i < $cnt; $i++){
        $res = mb_strtolower($str);
    }
    $t2 = microtime(true) - $a2;
    // echo 'it took ' . round($t2 * 1000, 3) . ' ms for $i++'."\n";

    echo 'strtolower is '.round($t2/$t1, 2).'x faster than mb_strtolower for ' . $k . "\n\n";
}
Enter fullscreen mode Exit fullscreen mode

Source: PHP bugs

mb_* functions are slower, but it's always a tradeoff, and only the context should determine whether you should use these helpers or make your own.

For example, if you replace $cnt = 100000; by $cnt = 100; in the above script, mb_* helpers are still significantly slower, but the final impact might be fine in your case (e.g., 0.008 ms vs. 0.004 ms).

Wrap up

You must take multibytes into account, especially in a multingual context, and PHP has built-in helpers for that.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .