PHP 5.6.0beta1 released

mb_check_encoding

(PHP 4 >= 4.4.3, PHP 5 >= 5.1.3)

mb_check_encodingVérifie si une chaîne est valide pour un encodage spécifique

Description

bool mb_check_encoding ([ string $var = NULL [, string $encoding = mb_internal_encoding() ]] )

Vérifie si le flux d'octets est valide pour l'encodage spécifique.

Liste de paramètres

var

Le flux d'octets à vérifier. Si elle est omise, cette fonction vérifie toutes les entrées depuis le début de la requête.

encoding

Encodage attendu.

Valeurs de retour

Cette fonction retourne TRUE en cas de succès ou FALSE si une erreur survient.

add a note add a note

User Contributed Notes 6 notes

up
3
Stefan W
8 months ago
Note that the algorithm in javalc6's comment checks UTF-8 compliance by the letter of the specs.

This means that overlong byte sequences will pass. For example: 0xC0 0xAF can be used to encode U+002F, the slash character. While legal, this character is more properly encoded as 0x2F. Overlong sequences are unnecessary and should be avoided; they have been - and still are - used in various attacks (like directory traversal attacks).

It also means that high Unicode characters outside the Basic Multilingual Plane will pass; this means characters above U+FFFF, composed of 4+ bytes (hieroglyhps, cuneiform, etc). You need to decide if you want those characters or not. If you do, be aware that they often cause compatibility problems (for example with JSON and some databases).

mb_check_encoding(), mb_detect_encoding(x, y, TRUE), and the other comments up to now all reject characters outside the BMP and overlong sequences.
up
1
jbricci at ya-right dot com
5 years ago
This function does not check for bad byte sequence(s), it only checks if the byte stream is valid. If you want to verify a encoded string is valid, (IE: does not contain any bad byte sequences do the following...

<?php

/* check a strings encoded value */

function checkEncoding ( $string, $string_encoding )
{
   
$fs = $string_encoding == 'UTF-8' ? 'UTF-32' : $string_encoding;

   
$ts = $string_encoding == 'UTF-32' ? 'UTF-8' : $string_encoding;

    return
$string === mb_convert_encoding ( mb_convert_encoding ( $string, $fs, $ts ), $ts, $fs );
}

/* test 1 variables */

$string = "\x00\x81";

$encoding = "Shift_JIS";

/* test 1 mb_check_encoding (test for bad byte stream) */

if ( true === mb_check_encoding ( $string, $encoding ) )
{
    echo
'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
    echo
'invalid (' . $encoding . ') encoded byte stream!<br />';
}

/* test 1 checkEncoding (test for bad byte sequence(s)) */

if ( true === checkEncoding ( $string, $encoding ) )
{
    echo
'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
    echo
'invalid (' . $encoding . ') encoded byte sequence!<br />';
}

/* test 2 */

/* test 2 variables */

$string = "\x00\xE3";

$encoding = "UTF-8";

/* test 2 mb_check_encoding (test for bad byte stream) */

if ( true === mb_check_encoding ( $string, $encoding ) )
{
    echo
'valid (' . $encoding . ') encoded byte stream!<br />';
}
else
{
    echo
'invalid (' . $encoding . ') encoded byte stream!<br />';
}

/* test 2 checkEncoding (test for bad byte sequence(s)) */

if ( true === checkEncoding ( $string, $encoding ) )
{
    echo
'valid (' . $encoding . ') encoded byte sequence!<br />';
}
else
{
    echo
'invalid (' . $encoding . ') encoded byte sequence!<br />';
}

?>
up
0
eyecatchup at gmail dot com
3 months ago
Unlike other comments suggest, there's no need to serialize a string to use preg_match's "u" modifier for testing if a string is valid UTF-8. You can just use
<?php
function is_utf8($str) {
    return (bool)
preg_match('//u', $str);
}
up
-1
CertaiN
8 months ago
For supporting non-scalar variables,

<?php
function validate_utf8($input) {
    return (bool)
preg_match('//u', serialize($input));
}
up
-1
CertaiN
8 months ago
The best way to validate UTF-8 sequence.
This works for not only scalar, but also array and object recursively.

<?php

function is_valid_utf8($text) {
    return (bool)
preg_match('//u', serialize($text));
}

?>
up
-2
javalc6 at gmail dot com
4 years ago
In order to check if a string is encoded correctly in utf-8, I suggest the following function, that implements the RFC3629 better than mb_check_encoding():

<?php
function check_utf8($str) {
   
$len = strlen($str);
    for(
$i = 0; $i < $len; $i++){
       
$c = ord($str[$i]);
        if (
$c > 128) {
            if ((
$c > 247)) return false;
            elseif (
$c > 239) $bytes = 4;
            elseif (
$c > 223) $bytes = 3;
            elseif (
$c > 191) $bytes = 2;
            else return
false;
            if ((
$i + $bytes) > $len) return false;
            while (
$bytes > 1) {
               
$i++;
               
$b = ord($str[$i]);
                if (
$b < 128 || $b > 191) return false;
               
$bytes--;
            }
        }
    }
    return
true;
}
// end of check_utf8
?>
To Top