This post is archived and probably outdated.

Unicode identifiers

2009-07-23 00:33:37

When I see people talking about Unicode and PHP 6 I often see them mentioning one fact as a big change: PHP 6 allows (mostly any) arbitrary Unicode character as (part of an) identifier. So you can have code like this:

function 新日本石油() {
    echo "Let's hope this isn't an offensive function name... ";
    echo "it's copied from some news site";
}

新日本石油();

Well yes, that's funny, at first but serves a purpose: Consider you have an application tied to an environment with a special terminology, then translating this terms to English might be extremely confusing (especially as programmers often don't really know the correct terminology of that domain) and it's good to call the thing by it's name - while that can be quite complicated, too, in a previous job we had such a case and often used the German terms which produced quite funny names for getters and setters which didn't satisfy us ... but that's not what I wanted to talk about.

The purpose of this were some bad news: That's nothing new. The relevant scanner rule hasn't changed since 4.0 - the only change is that PHP 6 doesn't treat it as random set of bytes anymore but knows about Unicode codepoints and interprets is as such.

Out of interest I did some little digging into the PHP repository's history:

$ svn annotate trunk/Zend/zend_language_scanner.l
...
34779 zeev LABEL [a-zA-Z_\x7f-\xff][a-zA-Z0-9_\x7f-\xff]*
...
$ svn log -r 34779
------------------------------------------------------------------------
r34779 | zeev | 2000-10-29 15:35:34 +0100 (Sun, 29 Oct 2000) | 2 lines

Unify the names of these last 3 files...

------------------------------------------------------------------------

The result? - The rule, as it is in PHP 6 wasn't changed since 2000, really nothing new there with PHP 6, and even then the only change was that the scanner file was renamed...