Thursday, November 8, 2007

The Case on encoding - is "utf-8" and "UTF-8" the same?

pencil icon, that"s clickable to start editing the post

I am a techiee so some details grow out of proportions for me. One of those details are how to write the encoding in the XML declaration. Since I've written quite some XML-documents if thought about it many times. Should it be "utf-8" or "UTF-8"? does it make any difference? is it because MS platforms don't care about capital letters and the *NIX does? what's the prettiest?.

Actually I have looked into this before but forgot, so now I'm putting it in a post. There's only one natural source to look at - the specification: Extensible Markup Language (XML) 1.0 (Fourth Edition) [HTML] (W3C Recommendation 16 August 2006, edited in place 29 September 2006). Section "4.3 Parsed Entities" contains a description of the XML declaration ("4.3.1 The Text Declaration") with the encoding related part in "4.3.3 Character Encoding in Entities":

Encoding Declaration

[80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" )
[81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Encoding name contains only Latin characters */

yes, it's in the regular expression for EncName, both upper and lower cases are allowed. It's written in text further down (my bold):

In an encoding declaration, the values "UTF-8", "UTF-16", "ISO-10646-UCS-2", and "ISO-10646-UCS-4" SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values "ISO-8859-1", "ISO-8859-2", ... "ISO-8859-n" (where n is the part number) SHOULD be used for the parts of ISO 8859, and the values "ISO-2022-JP", "Shift_JIS", and "EUC-JP" SHOULD be used for the various encoded forms of JIS X-0208-1997. It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings SHOULD use names starting with an "x-" prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULD either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings).

Looking in IANA "Official Names for Character Sets" gives the same story:

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters.

So that settles it clearly upper or lower case isn't important, so "UTF-8" or "utf-8" should be interpreted the same.