PHP 转换字符的编码

用户评论:

josip at cubrad dot com (2013-06-27 23:24:17)

For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made ??this: function w1250_to_utf8($text) { // map based on: // http://konfiguracja.c0.pl/iso02vscp1250en.html // http://konfiguracja.c0.pl/webpl/index_en.html#examp // http://www.htmlentities.com/html/entities/ $map = array( chr(0x8A) => chr(0xA9), chr(0x8C) => chr(0xA6), chr(0x8D) => chr(0xAB), chr(0x8E) => chr(0xAE), chr(0x8F) => chr(0xAC), chr(0x9C) => chr(0xB6), chr(0x9D) => chr(0xBB), chr(0xA1) => chr(0xB7), chr(0xA5) => chr(0xA1), chr(0xBC) => chr(0xA5), chr(0x9F) => chr(0xBC), chr(0xB9) => chr(0xB1), chr(0x9A) => chr(0xB9), chr(0xBE) => chr(0xB5), chr(0x9E) => chr(0xBE), chr(0x80) => '€', chr(0x82) => '&sbquo;', chr(0x84) => '&bdquo;', chr(0x85) => '…', chr(0x86) => '&dagger;', chr(0x87) => '&Dagger;', chr(0x89) => '&permil;', chr(0x8B) => '&lsaquo;', chr(0x91) => '‘', chr(0x92) => '’', chr(0x93) => '“', chr(0x94) => '”', chr(0x95) => '•', chr(0x96) => '–', chr(0x97) => '—', chr(0x99) => '™', chr(0x9B) => '’', chr(0xA6) => '¦', chr(0xA9) => '©', chr(0xAB) => '«', chr(0xAE) => '®', chr(0xB1) => '±', chr(0xB5) => 'µ', chr(0xB6) => '¶', chr(0xB7) => '·', chr(0xBB) => '»', ); return html_entity_decode(mb_convert_encoding(strtr($text, $map), 'UTF-8', 'ISO-8859-2'), ENT_QUOTES, 'UTF-8'); }

urko at wegetit dot eu (2012-09-11 18:17:29)

If you are trying to generate a CSV (with extended chars) to be opened at Exel for Mac, the only that worked for me was: <?php mb_convert_encoding( $CSV, 'Windows-1252', 'UTF-8'); ?> I also tried this: <?php //Separado OK, chars MAL iconv('MACINTOSH', 'UTF8', $CSV); //Separado MAL, chars OK chr(255).chr(254).mb_convert_encoding( $CSV, 'UCS-2LE', 'UTF-8'); ?> But the first one didn't show extended chars correctly, and the second one, did't separe fields correctly

qdb at kukmara dot ru (2011-11-07 07:44:54)

mb_substr and probably several other functions works faster in ucs-2 than in utf-8. and utf-16 works slower than utf-8. here is test, ucs-2 is near 50 times faster than utf-8, and utf-16 is near 6 times slower than utf-8 here: <?php header('Content-Type: text/html; charset=utf-8'); mb_internal_encoding('utf-8'); $s='укгез??ш?хз?х?шк2049??лдябчсячмииюсит.июб?рарэ' .'лдэфв??у?й?уй??у034928348539857?шаыдларораш??рлоавы'; $s.=$s; $s.=$s; $s.=$s; $s.=$s; $s.=$s; $s.=$s; $s.=$s; $t1=microtime(true); $i=0; while($i<mb_strlen($s)){ $a=mb_substr($s,$i,2); $i+=2; if($i==10)echo$a.'. '; //echo$a.'. '; } echo$i.'. '; echo(microtime(true)-$t1); echo'<br>'; $s=mb_convert_encoding($s,'UCS-2','utf8'); mb_internal_encoding('UCS-2'); $t1=microtime(true); $i=0; while($i<mb_strlen($s)){ $a=mb_substr($s,$i,2); $i+=2; if($i==10)echo mb_convert_encoding($a,'utf8','ucs2').'. '; //echo$a.'. '; } echo$i.'. '; echo(microtime(true)-$t1); echo'<br>'; $s=mb_convert_encoding($s,'utf-16','ucs-2'); mb_internal_encoding('utf-16'); $t1=microtime(true); $i=0; while($i<mb_strlen($s)){ $a=mb_substr($s,$i,2); $i+=2; if($i==10)echo mb_convert_encoding($a,'utf8','utf-16').'. '; //echo$a.'. '; } echo$i.'. '; echo(microtime(true)-$t1); ?> output: ?х. 12416. 1.71738100052 ?х. 12416. 0.0211279392242 ?х. 12416. 11.2330229282

Alexanderdotbooneatmnsudotedu (2011-06-09 10:50:43)

I was having trouble parsing an Excel document to html, this seems to do the trick for character encoding. <?php function decode_characters($info) { $info = mb_convert_encoding($info, "HTML-ENTITIES", "UTF-8"); $info = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$info); return($info); } ?>

alvaro at demogracia dot com (2011-04-06 03:37:32)

You can use the HTML-ENTITIES encoding to insert arbitrary Unicode characters, e.g. "U+25B6 (BLACK RIGHT-POINTING TRIANGLE)": <?php $triangle = mb_convert_encoding('▶', 'UTF-8', 'HTML-ENTITIES'); ?> Be aware though that older PHP versions do not seem to support hexadecimal notation (▶). In such case, use decimal notation (▶): <?php $triangle = mb_convert_encoding('▶', 'UTF-8', 'HTML-ENTITIES'); ?>

gullevek at gullevek dot org (2010-08-25 00:27:44)

If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.

regrunge at hotmail dot it (2010-05-14 08:00:29)

I've been trying to find the charset of a norwegian (with a lot of ?, ?, ?) txt file written on a Mac, i've found it in this way: <?php $text = "A strange string to pass, maybe with some ?, ?, ? characters."; foreach(mb_list_encodings() as $chr){ echo mb_convert_encoding($text, 'UTF-8', $chr)." : ".$chr."<br>"; } ?> The line that looks good, gives you the encoding it was written in. Hope can help someone

Daniel Trebbien (2009-07-23 11:25:38)

Note that `mb_convert_encoding($val, 'HTML-ENTITIES')` does not escape '\'', '"', '<', '>', or '&'.

me at gsnedders dot com (2009-06-18 15:06:42)

It appears that when dealing with an unknown "from encoding" the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the "to encoding".

alexandrefelipemuller at gmail dot com (2009-02-18 12:06:20)

I used this function insted mb_convert_encoding, because mbstring wasn't enabled at my comercial server. It only suports utf7, 8 e iso 8859-1: <?php function my_convert_encoding($string,$to,$from) { // Convert string to ISO_8859-1 if ($from == "UTF-8") $iso_string = utf8_decode($string); else if ($from == "UTF7-IMAP") $iso_string = imap_utf7_decode($string); else $iso_string = $string; // Convert ISO_8859-1 string to result coding if ($to == "UTF-8") return(utf8_encode($iso_string)); else if ($to == "UTF7-IMAP") return(imap_utf7_encode($iso_string)); else return($iso_string); } ?>

chzhang at gmail dot com (2009-01-05 00:34:31)

instead of ini_set(), you can try this mb_substitute_character("none");

francois at bonzon point com (2008-11-10 17:05:38)

aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive: mbstring.substitute_character = "none" in your php.ini. Be sure to include the quotes around none. Or at run-time with <?php ini_set('mbstring.substitute_character', "none"); ?>

aaron at aarongough dot com (2008-11-07 08:24:46)

My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!) Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding. <?php function convert_to ( $source, $target_encoding ) { // detect the character encoding of the incoming file $encoding = mb_detect_encoding( $source, "auto" ); // escape all of the question marks so we can remove artifacts from // the unicode conversion process $target = str_replace( "?", "[question_mark]", $source ); // convert the string to the target encoding $target = mb_convert_encoding( $target, $target_encoding, $encoding); // remove any question marks that have been introduced because of illegal characters $target = str_replace( "?", "", $target ); // replace the token string "[question_mark]" with the symbol "?" $target = str_replace( "[question_mark]", "?", $target ); return $target; } ?> Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity) -A

Edward (2008-09-16 03:54:55)

If mb_convert_encoding doesn't work for you, and iconv gives you a headache, you might be interested in this free class I found. It can convert almost any charset to almost any other charset. I think it's wonderful and I wish I had found it earlier. It would have saved me tons of headache. I use it as a fail-safe, in case mb_convert_encoding is not installed. Download it from http://mikolajj.republika.pl/ This is not my own library, so technically it's not spamming, right? ;) Hope this helps.

StigC (2008-08-13 15:38:47)

For the php-noobs (like me) - working with flash and php. Here's a simple snippet of code that worked great for me, getting php to show special Danish characters, from a Flash email form: <?php // Name Escape $escName = mb_convert_encoding($_POST["Name"], "ISO-8859-1", "UTF-8"); // message escape $escMessage = mb_convert_encoding($_POST["Message"], "ISO-8859-1", "UTF-8"); // Headers.. and so on... ?>

nospam at nihonbunka dot com (2008-05-15 18:51:34)

rodrigo at bb2 dot co dot jp wrote that inconv works better than mb_convert_encoding, I find that when converting from uft8 to shift_jis $conv_str = mb_convert_encoding($str,$toCS,$fromCS); works while $conv_str = iconv($fromCS,$toCS.'//IGNORE',$str); removes tildes from $str.

katzlbtjunk at hotmail dot com (2008-01-25 04:36:30)

Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful. $fileName = 'Test:!"$%&/()=?????ü<<'; echo strtr(mb_convert_encoding($fileName,'ASCII'), ' ,;:?*#!§$%&/(){}<>=`?|\\\'"', '____________________________');

rodrigo at bb2 dot co dot jp (2008-01-15 03:47:52)

For those who can?t use mb_convert_encoding() to convert from one charset to another as a metter of lower version of php, try iconv(). I had this problem converting to japanese charset: $txt=mb_convert_encoding($txt,'SJIS',$this->encode); And I could fix it by using this: $txt = iconv('UTF-8', 'SJIS', $txt); Maybe it?s helpfull for someone else! ;)

mightye at gmail dot com (2007-11-13 09:24:48)

To petruzanauticoyahoo?com!ar If you don't specify a source encoding, then it assumes the internal (default) encoding. ? is a multi-byte character whose bytes in your configuration default (often iso-8859-1) would actually mean ?±. mb_convert_encoding() is upgrading those characters to their multi-byte equivalents within UTF-8. Try this instead: <?php print mb_convert_encoding( "?", "UTF-8", "UTF-8" ); ?> Of course this function does no work (for the most part - it can actually be used to strip characters which are not valid for UTF-8).

volker at machon dot biz (2007-09-24 21:05:34)

Hey guys. For everybody who's looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here's your solution: public function encodeToUtf8($string) { return mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true)); } public function encodeToIso($string) { return mb_convert_encoding($string, "ISO-8859-1", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true)); } For me these functions are working fine. Give it a try

aofg (2007-08-21 18:49:55)

When converting Japanese strings to ISO-2022-JP or JIS on PHP >= 5.2.1, you can use "ISO-2022-JP-MS" instead of them. Kishu-Izon (platform dependent) characters are converted correctly with the encoding, as same as with eucJP-win or with SJIS-win.

David Hull (2006-12-20 10:52:40)

As an alternative to Johannes's suggestion for converting strings from other character sets to a 7bit representation while not just deleting latin diacritics, you might try this: <?php $text = iconv($from_enc, 'US-ASCII//TRANSLIT', $text); ?> The only disadvantage is that it does not convert "?" to "ae", but it handles punctuation and other special characters better. -- David

phpdoc at jeudi dot de (2006-09-05 06:46:41)

I\'d like to share some code to convert latin diacritics to their traditional 7bit representation, like, for example, - à,ç,é,î,... to a,c,e,i,... - ß to ss - ä,Ä,... to ae,Ae,... - ë,... to e,... (mb_convert \"7bit\" would simply delete any offending characters). I might have missed on your country\'s typographic conventions--correct me then. <?php /** * @args string $text line of encoded text * string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1) * * @returns 7bit representation */ function to7bit($text,$from_enc) { $text = mb_convert_encoding($text,\'HTML-ENTITIES\',$from_enc); $text = preg_replace( array(\'/ß/\',\'/&(..)lig;/\', \'/&([aouAOU])uml;/\',\'/&(.)[^;]*;/\'), array(\'ss\',\"$1\",\"$1\".\'e\',\"$1\"), $text); return $text; } ?> Enjoy :-) Johannes == [EDIT BY danbrown AT php DOT net: Author provided the following update on 27-FEB-2012.] == An addendum to my "to7bit" function referenced below in the notes. The function is supposed to solve the problem that some languages require a different 7bit rendering of special (umlauted) characters for sorting or other applications. For example, the German ß ligature is usually written "ss" in 7bit context. Dutch ÿ is typically rendered "ij" (not "y"). The original function works well with word (alphabet) character entities and I've seen it used in many places. But non-word entities cause funny results: E.g., "©" is rendered as "c", "" as "s" and "&rquo;" as "r". The following version fixes this by converting non-alphanumeric characters (also chains thereof) to '_'. <?php /** * @args string $text line of encoded text * string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1) * * @returns 7bit representation */ function to7bit($text,$from_enc) { $text = preg_replace(/W+/,'_',$text); $text = mb_convert_encoding($text,'HTML-ENTITIES',$from_enc); $text = preg_replace( array('/ß/','/&(..)lig;/', '/&([aouAOU])uml;/','/ÿ/','/&(.)[^;]*;/'), array('ss',"$1","$1".'e','ij',"$1"), $text); return $text; } ?> Enjoy again, Johannes

mac.com@nemo (2006-07-08 07:38:47)

For those wanting to convert from $set to MacRoman, use iconv(): <?php $string = iconv('UTF-8', 'macintosh', $string); ?> ('macintosh' is the IANA name for the MacRoman character set.)

eion at bigfoot dot com (2006-02-20 16:54:52)

many people below talk about using <?php mb_convert_encode($s,'HTML-ENTITIES','UTF-8'); ?> to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it's beautiful UTF-8 glory. So [insert korean characters here] turned into ?????. I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP) so instead of <?php mb_convert_encode(&$s,'HTML-ENTITIES','UTF-8'); ?> which worked perfectly until I upgraded, so I had to use <?php call_user_func_array('mb_convert_encoding', array(&$s,'HTML-ENTITIES','UTF-8')); ?> Hope it helps someone else out

Tom Class (2005-11-11 07:35:53)

Why did you use the php html encode functions? mbstring has it's own Encoding which is (as far as I tested it) much more usefull: HTML-ENTITIES Example: $text = mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8");

Stephan van der Feest (2005-09-09 04:47:41)

To add to the Flash conversion comment below, here's how I convert back from what I've stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field: function htmltoflash($htmlstr) { return str_replace("<br />","\n", str_replace("<","<", str_replace(">",">", mb_convert_encoding(html_entity_decode($htmlstr), "UTF-8","ISO-8859-1")))); }

Stephan van der Feest (2005-09-09 03:50:54)

Here's a tip for anyone using Flash and PHP for storing HTML output submitted from a Flash text field in a database or whatever. Flash submits its HTML special characters in UTF-8, so you can use the following function to convert those into HTML entity characters: function utf8html($utf8str) { return htmlentities(mb_convert_encoding($utf8str,"ISO-8859-1","UTF-8")); }

jamespilcher1 - hotmail (2004-02-01 19:55:57)

be careful when converting from iso-8859-1 to utf-8. even if you explicitly specify the character encoding of a page as iso-8859-1(via headers and strict xml defs), windows 2000 will ignore that and interpret it as whatever character set it has natively installed. for example, i wrote char #128 into a page, with char encoding iso-8859-1, and it displayed in internet explorer (& mozilla) as a euro symbol. it should have displayed a box, denoting that char #128 is undefined in iso-8859-1. The problem was it was displaying in "Windows: western europe" (my native character set). this led to confusion when i tried to convert this euro to UTF-8 via mb_convert_encoding() IE displays UTF-8 correctly- and because PHP correctly converted #128 into a box in UTF-8, IE would show a box. so all i saw was mb_convert_encoding() converting a euro symbol into a box. It took me a long time to figure out what was going on.

lanka at eurocom dot od dot ua (2003-02-07 08:03:56)

Another sample of recoding without MultiByte enabling. (Russian koi->win, if input in win-encoding already, function recode() returns unchanged string) <?php // 0 - win // 1 - koi function detect_encoding($str) { $win = 0; $koi = 0; for($i=0; $i<strlen($str); $i++) { if( ord($str[$i]) >224 && ord($str[$i]) < 255) $win++; if( ord($str[$i]) >192 && ord($str[$i]) < 223) $koi++; } if( $win < $koi ) { return 1; } else return 0; } // recodes koi to win function koi_to_win($string) { $kw = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 254, 224, 225, 246, 228, 229, 244, 227, 245, 232, 233, 234, 235, 236, 237, 238, 239, 255, 240, 241, 242, 243, 230, 226, 252, 251, 231, 248, 253, 249, 247, 250, 222, 192, 193, 214, 196, 197, 212, 195, 213, 200, 201, 202, 203, 204, 205, 206, 207, 223, 208, 209, 210, 211, 198, 194, 220, 219, 199, 216, 221, 217, 215, 218); $wk = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 225, 226, 247, 231, 228, 229, 246, 250, 233, 234, 235, 236, 237, 238, 239, 240, 242, 243, 244, 245, 230, 232, 227, 254, 251, 253, 255, 249, 248, 252, 224, 241, 193, 194, 215, 199, 196, 197, 214, 218, 201, 202, 203, 204, 205, 206, 207, 208, 210, 211, 212, 213, 198, 200, 195, 222, 219, 221, 223, 217, 216, 220, 192, 209); $end = strlen($string); $pos = 0; do { $c = ord($string[$pos]); if ($c>128) { $string[$pos] = chr($kw[$c-128]); } } while (++$pos < $end); return $string; } function recode($str) { $enc = detect_encoding($str); if ($enc==1) { $str = koi_to_win($str); } return $str; } ?>

mb_convert_encoding

说明

参数

返回值

范例

参见

用户评论: