(PHP 5 >= 5.3.0, PECL intl >= 1.0.0)
Normalizer::normalize -- normalizer_normalize — Normalizes the input provided and returns the normalized string
面向对象风格
$input
[, string $form
= Normalizer::FORM_C
] )过程化风格
$input
[, string $form
= Normalizer::FORM_C
] )Normalizes the input provided and returns the normalized string
input
The input string to normalize
form
One of the normalization forms.
The normalized string or NULL
if an error occurred.
Example #1 normalizer_normalize() example
<?php
$char_A_ring = "\xC3\x85"; // 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_combining_ring_above = "\xCC\x8A"; // 'COMBINING RING ABOVE' (U+030A)
$char_1 = normalizer_normalize( $char_A_ring, Normalizer::FORM_C );
$char_2 = normalizer_normalize( 'A' . $char_combining_ring_above, Normalizer::FORM_C );
echo urlencode($char_1);
echo ' ';
echo urlencode($char_2);
?>
Example #2 OO example
<?php
$char_A_ring = "\xC3\x85"; // 'LATIN CAPITAL LETTER A WITH RING ABOVE' (U+00C5)
$char_combining_ring_above = "\xCC\x8A"; // 'COMBINING RING ABOVE' (U+030A)
$char_1 = Normalizer::normalize( $char_A_ring, Normalizer::FORM_C );
$char_2 = Normalizer::normalize( 'A' . $char_combining_ring_above, Normalizer::FORM_C );
echo urlencode($char_1);
echo ' ';
echo urlencode($char_2);
?>
以上例程会输出:
%C3%85 %C3%85
o_shes01 at uni-muenster dot de (2011-01-23 08:59:55)
This method/function will return boolean false if $input is not a valid utf-8-string, e.g.
<?php
var_dump(Normalizer::normalize("\xFF"));
// prints "bool(false)"
?>
akniep at rayo dot info (2009-07-30 09:03:53)
Especially when matching texts against each-other or against keywords, it is helpful to normalize the texts before.
The following function removes all diacritics (marks like accents) from a given UTF8-encoded texts and returns ASCii-text.
Be sure to have the PHP-Normalizer-extension (intl and icu) installed.
Tipp: You may also want to map the text to lower case before execute matching procedures ...
<?php
function normalizeUtf8String( $s)
{
// Normalizer-class missing!
if (! class_exists("Normalizer", $autoload = false))
return $original_string;
// maps German (umlauts) and other European characters onto two characters before just removing diacritics
$s = preg_replace( '@\x{00c4}@u' , "AE", $s ); // umlaut ? => AE
$s = preg_replace( '@\x{00d6}@u' , "OE", $s ); // umlaut ? => OE
$s = preg_replace( '@\x{00dc}@u' , "UE", $s ); // umlaut ? => UE
$s = preg_replace( '@\x{00e4}@u' , "ae", $s ); // umlaut ? => ae
$s = preg_replace( '@\x{00f6}@u' , "oe", $s ); // umlaut ? => oe
$s = preg_replace( '@\x{00fc}@u' , "ue", $s ); // umlaut ü => ue
$s = preg_replace( '@\x{00f1}@u' , "ny", $s ); // ? => ny
$s = preg_replace( '@\x{00ff}@u' , "yu", $s ); // ? => yu
// maps special characters (characters with diacritics) on their base-character followed by the diacritical mark
// exmaple: ? => U?, á => a`
$s = Normalizer::normalize( $s, Normalizer::FORM_D );
$s = preg_replace( '@\pM@u' , "", $s ); // removes diacritics
$s = preg_replace( '@\x{00df}@u' , "ss", $s ); // maps German ? onto ss
$s = preg_replace( '@\x{00c6}@u' , "AE", $s ); // ? => AE
$s = preg_replace( '@\x{00e6}@u' , "ae", $s ); // ? => ae
$s = preg_replace( '@\x{0132}@u' , "IJ", $s ); // ? => IJ
$s = preg_replace( '@\x{0133}@u' , "ij", $s ); // ? => ij
$s = preg_replace( '@\x{0152}@u' , "OE", $s ); // ? => OE
$s = preg_replace( '@\x{0153}@u' , "oe", $s ); // ? => oe
$s = preg_replace( '@\x{00d0}@u' , "D", $s ); // ? => D
$s = preg_replace( '@\x{0110}@u' , "D", $s ); // ? => D
$s = preg_replace( '@\x{00f0}@u' , "d", $s ); // ? => d
$s = preg_replace( '@\x{0111}@u' , "d", $s ); // d => d
$s = preg_replace( '@\x{0126}@u' , "H", $s ); // H => H
$s = preg_replace( '@\x{0127}@u' , "h", $s ); // h => h
$s = preg_replace( '@\x{0131}@u' , "i", $s ); // i => i
$s = preg_replace( '@\x{0138}@u' , "k", $s ); // ? => k
$s = preg_replace( '@\x{013f}@u' , "L", $s ); // ? => L
$s = preg_replace( '@\x{0141}@u' , "L", $s ); // L => L
$s = preg_replace( '@\x{0140}@u' , "l", $s ); // ? => l
$s = preg_replace( '@\x{0142}@u' , "l", $s ); // l => l
$s = preg_replace( '@\x{014a}@u' , "N", $s ); // ? => N
$s = preg_replace( '@\x{0149}@u' , "n", $s ); // ? => n
$s = preg_replace( '@\x{014b}@u' , "n", $s ); // ? => n
$s = preg_replace( '@\x{00d8}@u' , "O", $s ); // ? => O
$s = preg_replace( '@\x{00f8}@u' , "o", $s ); // ? => o
$s = preg_replace( '@\x{017f}@u' , "s", $s ); // ? => s
$s = preg_replace( '@\x{00de}@u' , "T", $s ); // ? => T
$s = preg_replace( '@\x{0166}@u' , "T", $s ); // T => T
$s = preg_replace( '@\x{00fe}@u' , "t", $s ); // ? => t
$s = preg_replace( '@\x{0167}@u' , "t", $s ); // t => t
// remove all non-ASCii characters
$s = preg_replace( '@[^\0-\x80]@u' , "", $s );
// possible errors in UTF8-regular-expressions
if (empty($s))
return $original_string;
else
return $s;
}
?>
The above function is mainly based on the following article:
http://ahinea.com/en/tech/accented-translate.html