This post is archived and probably outdated.

Mind the encodings!

2010-03-28 03:31:00

Dealing with character sets and encodings is tough. As long as you're dealing only with English texts you in a luxury situation and can mix utf-8 and iso-8859-1 encoded texts and most (all?) of your tests will work. Some of your users, like me, with strange names ("Schlüter") will be annoyed as your application breaks them ("Schlüter"), but these will be edge cases. There are bigger issues with mixing encodings but that's not what I wanted to tell now.

Handling these encodings in PHP correctly is tough. PHP, currently, has a quite simple approach to the problem in general: PHP doesn't care about encodings. A string is simply a sequence of bytes. Well in general. The details are difficult. Core features like JSON-handling or XML processing expect that your PHP strings are encoded using utf-8, which is a sane choice, for JSON it is part of the JSON specification, for XML it's the only way to be able to work with all documents without additional information from you along every XML operation.

On the other side we have browsers. A browser has to read documents from all over the world which can be encoded in any encoding you like. For whatever reason browser developers decided that iso-8859-1 would make a great default encoding, which means that if a response to a browser's request doesn't specify anything else the browser assumes the document is encoded using iso-8859-1. The document's encoding will also be used when sending form data back to the server. To handle this, PHP has a php.ini setting default_charset which, if set, will set the selected encoding in the HTTP header. The default value of this is setting is empty - so browsers fall back on their default, iso-8859-1.

Recently Rasmus made a commit to PHP trunk which changes the default to utf-8.

In the long-run this changes is good as it works for all languages, iso-8859-1 only works for a limited set of European languages, and works better with the outside environment, where utf-8 adoption is growing (JSON and XML were examples).

In the short-term this might cause trouble for applications depending on the default.The good thing is that the development in trunk has just started and the release of PHP.next is still sometime away and you can easily prepare your application - which is a good thing anyways to protect from administrators making mistakes with current versions already.

To set the encoding from within your application you can for example call ini_set('default_charset', $enc); or header("Content-type: text/html; charset=$enc"); at the beginning of your script, where $enc is your preferred encoding. Please mind that this has no effect on the script itself, which, for instance, means you have to configure your database connection accordingly, too!