grapheme_extract

(PHP 5 >= 5.3.0, PECL intl >= 1.0.0)

grapheme_extract — Function to extract a sequence of default grapheme clusters from a text buffer, which must be encoded in UTF-8.

Опис

Процедурний стиль

string grapheme_extract ( string $haystack , int $size [, int $extract_type [, int $start = 0 [, int &$next ]]] )

Function to extract a sequence of default grapheme clusters from a text buffer, which must be encoded in UTF-8.

Параметри

haystack

String to search.

size

Maximum number items - based on the $extract_type - to return.

extract_type

Defines the type of units referred to by the $size parameter:

GRAPHEME_EXTR_COUNT (default) - $size is the number of default grapheme clusters to extract.
GRAPHEME_EXTR_MAXBYTES - $size is the maximum number of bytes returned.
GRAPHEME_EXTR_MAXCHARS - $size is the maximum number of UTF-8 characters returned.

start

Starting position in $haystack in bytes - if given, it must be zero or a positive value that is less than or equal to the length of $haystack in bytes. If $start does not point to the first byte of a UTF-8 character, the start position is moved to the next character boundary.

next

Reference to a value that will be set to the next starting position. When the call returns, this may point to the first byte position past the end of the string.

Значення, що повертаються

A string starting at offset $start and ending on a default grapheme cluster boundary that conforms to the $size and $extract_type specified.

Приклади

Приклад #1 grapheme_extract() example


<?php

$char_a_ring_nfd = "a\xCC\x8A";  // 'LATIN SMALL LETTER A WITH RING ABOVE' (U+00E5) normalization form "D"
$char_o_diaeresis_nfd = "o\xCC\x88"; // 'LATIN SMALL LETTER O WITH DIAERESIS' (U+00F6) normalization form "D"

print urlencode(grapheme_extract( $char_a_ring_nfd . $char_o_diaeresis_nfd, 1, GRAPHEME_EXTR_COUNT, 2));

?>

Наведений вище приклад виведе:

o%CC%88

Прогляньте Також

grapheme_substr() - Return part of a string
» Unicode Text Segmentation: Grapheme Cluster Boundaries

add a note

User Contributed Notes 2 notes

down

AJH ¶

12 years ago


Here's how to use grapheme_extract() to loop across a UTF-8 string character by character.

<?php

$str = "سabcक’…";
// if the previous line didn't come through, the string contained:
//U+0633,U+0061,U+0062,U+0063,U+0915,U+2019,U+2026

$n = 0;

for (    $start = 0, $next = 0, $maxbytes = strlen($str), $c = '';
        $start < $maxbytes;
        $c = grapheme_extract($str, 1, GRAPHEME_EXTR_MAXCHARS , ($start = $next), $next)
    )
{
    if (empty($c))
        continue;
    echo "This utf8 character is " . strlen($c) . " bytes long and its first byte is " . ord($c[0]) . "\n";
    $n++;
}
echo "$n UTF-8 characters in a string of $maxbytes bytes!\n";
// Should print: 7 UTF8 characters in a string of 14 bytes!
?>

down

yevgen dot grytsay at gmail dot com ¶

3 years ago


Looping through grapheme clusters:

<?php

// Example taken from Rust documentation: https://doc.rust-lang.org/book/ch08-02-strings.html#bytes-and-scalar-values-and-grapheme-clusters-oh-my
$str = "नमस्ते";
// Alternatively:
//$str = pack('C*', ...[224, 164, 168, 224, 164, 174, 224, 164, 184, 224, 165, 141, 224, 164, 164, 224, 165, 135]);
$next = 0;
$maxbytes = strlen($str);

var_dump($str);

while ($next < $maxbytes) {
    $char = grapheme_extract($str, 1, GRAPHEME_EXTR_COUNT, $next, $next);
    if (empty($char)) {
        continue;
    }
    echo "{$char} - This utf8 character is " . strlen($char) . ' bytes long', PHP_EOL;
}

//string(18) "नमस्ते"
//न - This utf8 character is 3 bytes long
//म - This utf8 character is 3 bytes long
//स् - This utf8 character is 6 bytes long
//ते - This utf8 character is 6 bytes long
?>

add a note