This is a severe hack I used to get around some character translation issues encountered with non-latin character sets. You should at all costs use a character set that supports international characters whenever possible, probably UTF-8 if your application language supports it. That said, sometimes you need a work around, and this will at least make your pages readable for your users. This is NOT necessary if you are on a single unified platform.
I was working with mixed platforms. One application utilized PHP and UTF-8, while the other leveraged .NET and displayed in HTML with the ISO-8895-1 charset. Behind both was a shared Microsoft SQL Server database. The problem arose when UTF-8 data stored from the PHP application was displayed in the .NET app. Changing the .NET app over to UTF-8 wasn’t an option, so conversion to a character set that would work was required.
PHP has a number of tools to assist developers converting from one character set to another. iconv and the multi-byte string (mb_*) library both have utilities for handling exactly this type of conversion. The mb_* functions were not installed on the server and in this case, the iconv() function did not work as well as expected.
In practice, rather common characters like Microsoft single right “smart quotes” didn’t work with iconv(). Further, accented characters from languages like French and Spanish didn’t convert properly despite having suitable replacements in the ISO-8895-1 character set. This would leave my string littered with ? characters where suitable replacements weren’t found. So the hack I opted for was to convert all multi-byte characters to their HTML escaped equivalents. This places the burden on the browser to decide if it can display a symbol or not. It also turns out that scripting this conversion is much simpler than trying to handle the numerous potential characters and their named HTML equivalents. For any hexadecimal unicode character, \u2019 the right single quotation mark for example, can be re-written as which will then show as a single right quote.
It also turns out that the built-in json_encode() function converts multi-byte characters to unicode notation when serializing, so it’s easy to hand the curly quotes like \u2018 and \u2019. A simple regular expression with preg_replace() the string for the \u#### pattern and replaces it with the correct HTML escape sequence. This was tested with western european languages and it works sufficient for my purpose.
For small strings, such as values entered as part of a typical web based tool, the additional processing time is minimal. It may be faster to strip the json encoding quotes from the beginning and end of the string using substr(). However, for the length of data, the number of times this routine executes and for clarity, this was sufficient.
function unicode_escape_sequences($str){ $working = json_encode($str); $working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working); return json_decode($working); } |