Escaping Unicode Characters to HTML Entities in PHP

PHP Logo This is a severe hack I used to get around some character translation issues encountered with non-latin character sets. You should at all costs use a character set that supports international characters whenever possible, probably UTF-8 if your application language supports it. That said, sometimes you need a work around, and this will at least make your pages readable for your users. This is NOT necessary if you are on a single unified platform.

I was working with mixed platforms. One application utilized PHP and UTF-8, while the other leveraged .NET and displayed in HTML with the ISO-8895-1 charset. Behind both was a shared Microsoft SQL Server database. The problem arose when UTF-8 data stored from the PHP application was displayed in the .NET app. Changing the .NET app over to UTF-8 wasn’t an option, so conversion to a character set that would work was required.

PHP has a number of tools to assist developers converting from one character set to another. iconv and the multi-byte string (mb_*) library both have utilities for handling exactly this type of conversion. The mb_* functions were not installed on the server and in this case, the iconv() function did not work as well as expected.

In practice, rather common characters like Microsoft single right “smart quotes” didn’t work with iconv(). Further, accented characters from languages like French and Spanish didn’t convert properly despite having suitable replacements in the ISO-8895-1 character set. This would leave my string littered with ? characters where suitable replacements weren’t found. So the hack I opted for was to convert all multi-byte characters to their HTML escaped equivalents. This places the burden on the browser to decide if it can display a symbol or not. It also turns out that scripting this conversion is much simpler than trying to handle the numerous potential characters and their named HTML equivalents. For any hexadecimal unicode character, \u2019 the right single quotation mark for example, can be re-written as which will then show as a single right quote.

It also turns out that the built-in json_encode() function converts multi-byte characters to unicode notation when serializing, so it’s easy to hand the curly quotes like \u2018 and \u2019. A simple regular expression with preg_replace() the string for the \u#### pattern and replaces it with the correct HTML escape sequence. This was tested with western european languages and it works sufficient for my purpose.

For small strings, such as values entered as part of a typical web based tool, the additional processing time is minimal. It may be faster to strip the json encoding quotes from the beginning and end of the string using substr(). However, for the length of data, the number of times this routine executes and for clarity, this was sufficient.

View Code PHP

function unicode_escape_sequences($str){
	$working = json_encode($str);
	$working = preg_replace('/\\\u([0-9a-z]{4})/', '&#x$1;', $working);
	return json_decode($working);
}

Escaping Unicode Characters to HTML Entities in PHP

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List