Monday, January 28, 2008

SOAP and the Byte Order Mark (BOM)

pencil icon, that"s clickable to start editing the post

I've known about Byte Order Mark in the context of XML. But how about in the context of SOAP? To my simple understanding (SOAP) web services is all about exchanging XML Documents (mostly over HTTP), so since Byte Order Marks are part of the XML Specification it should also be part of SOAP. Since I've never thought or heard of it before nor of problems related to it I decided to look a bit closer.

In Wikipedia the Byte Order Mark the definition goes like:

A byte-order mark (BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space") when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32. It is conventionally used as a marker to indicate that text is encoded in UTF-8, UTF-16 or UTF-32.

In most character encodings the BOM is a pattern which is unlikely to be seen in other contexts (it would usually look like a sequence of obscure control codes). If a BOM is misinterpreted as an actual character within Unicode text then it will generally be invisible due to the fact it is a zero-width no-break space. Use of the U+FEFF character for non-BOM purposes has been deprecated in Unicode 3.2 (which provides an alternative, U+2060, for those other purposes), allowing U+FEFF to be used solely with the semantic of BOM.

Next place to look is the XML Specification, where some terms/defintions are needed. In 2 Documents

[Definition: A data object is an XML document if it is well-formed, as defined in this specification. In addition, the XML document is valid if it meets certain further constraints.]

Each XML document has both a logical and a physical structure. Physically, the document is composed of units called entities. An entity may refer to other entities to cause their inclusion in the document. A document begins in a "root" or document entity. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup. The logical and physical structures MUST nest properly, as described in 4.3.2 Well-Formed Parsed Entities.

Then down in 4.3.3 Character Encoding in Entities (my strong):

Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16.

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents.


In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration.

And a little more in Appendix F.1 Detection Without External Encoding Information.

In an older article in MSDN Library Archived the interoperability aspect is raised:Web Services Interoperability and SOAP () under XML Problems

The second set of possible interop issues are those involving XML parsing and XSD schema handling. SOAP uses XML and XML Schemas at its core, so interoperable handling of both is requisite for SOAP interop.

An interesting example of an interop issue involving both XML parsing and HTTP transports relates to the Byte Order Mark, or BOM. When sending data over HTTP, you can specify the encoding of the data, such as UTF-16 or UTF-8, in the Content-Type header. You can also indicate the encoding of a piece of XML by inserting a set of bytes that specify the encoding used. When sending UTF-16, the BOM is needed, even if the encoding is present in the Content-Type header (to indicate big-endian or little-endian), but for UTF-8 it is unnecessary.

The first three characters here are hex for the Byte Order Mark indicating UTF-8, but as you can see, the Content-Type also stated this. Some implementations send the BOM for UTF-8, even though they don't need to. Others are unable to process XML with any BOM. The solution here is to avoid sending it unless needed, and to correctly handle it. The correct handling of BOM is essential in processing UTF-16 messages, as BOM is required in this case. Although there is no single way to resolve such issues ahead of time, the best solution once issues are recognized is to refer to the actual specifications (usually found at the W3C) that describe the standards; then apply those specifications as the arbiter of any problem.

The sentence The solution here is to avoid sending it unless needed, and to correctly handle it sounds a little easy, but a variant of be strict on what you send and lax on what you receive which is a clever strategy though it's doesn't make it as easy for you as it should!

What to conclude?

The WS-I Basic Profile 1.0 addresses this issue in 3.1.3 Unicode BOMs:

XML 1.0 allows UTF-8 encoding to include a BOM; therefore, receivers of envelopes must be prepared to accept them. The BOM is mandatory for XML encoded as UTF-16.

R4001 A RECEIVER MUST accept envelopes that include the Unicode Byte Order Mark (BOM).

So it is intuitive as in what goes for XML goes for XML in SOAP.