Sunday, September 16, 2007

'xs:anyURI' something as strange as a semantic datatype

pencil icon, that"s clickable to start editing the post

I'll gladly admit that I'm at data-head so I focus on syntax, but that is as a mean to ease development so that the primary objective - getting the job done - succeed and here the overall importance is on semantics. With other words and the other way round: business builds on semantics, and to enable semantics though Information Technology syntax comes to rescue. But it's not always like that, which is kind of bad, and xs:anyURI is an example of that. Actually I found out some time ago, but I forgot and it was in a project I did lately I rediscovered the disappointing fact. In the XML Schema specification part two anyURI is defined. Here's an important note, that gives it away:

Note: Each URI scheme imposes specialized syntax rules for URIs in that scheme, including restrictions on the syntax of allowed fragment identifiers. Because it is impractical for processors to check that a value is a context-appropriate URI reference, this specification follows the lead of [RFC 2396] (as amended by [RFC 2732]) in this matter: such rules and restrictions are not part of type validity and are not checked by •minimally conforming• processors. Thus in practice the above definition imposes only very modest obligations on •minimally conforming• processors.

I have not investigated in what and minimally conforming processor is, but I guess that's what must of us run into. In essence all this can be done with xs:string. The xs:anyURI gives the choice of six different facets:

  • length
  • minLength
  • maxLength
  • pattern
  • enumeration
  • whiteSpace

An empty URI

Empty URI's are valid according to the RFC Uniform Resource Identifiers (URI): Generic Syntax in section "4.2. Same-document References":

A URI reference that does not contain a URI is a reference to the current document. In other words, an empty URI reference within a document is interpreted as a reference to the start of that document, and a reference containing only a fragment identifier is a reference to the identified fragment of that document. Traversal of such a reference should not result in an additional retrieval action. However, if the URI reference occurs in a context that is always intended to result in a new request, as in the case of HTML's FORM element, then an empty URI reference represents the base URI of the current document and should be replaced by that URI when transformed into a request.

In replies to a post to the xml-dev Re:[xml-dev] Can anyURI be empty first the reply from Michael Kay is quite clear:

On this, like so many other things, RFC 2396 is a total disaster. An empty string is not valid according to the BNF syntax, but the RFC gives detailed semantics for what it means (detailed semantics, though very imprecise semantics).

And the schema REC doesn't help. It has the famous note saying that the definition places "only very modest obligations" on an implementation, and it doesn't say what those obligations are.

Sperberg-McQueen gives some explanation as to the XML Schema specification being correct:

Yes. This is a direct result of our realization that we have as much trouble understanding RFC 2396 as anyone else. The anyURI type imposes the obligations of RFC 2396, whatever those are. Any attempt to paraphrase them on our part would lead, I fear, to an unsatisfactory result: either we would make some mistake (like believing that since the BNF does not accept the empty string, it must not be legal) or we would make no mistakes. In the one case, we'd be misleading our readers, and in either case, we'd find ourselves mired in a never-ending effort to prove that our paraphrase was, or was not, correct.

As I see it there are several things that should be made. First off the RFC 2396 should be corrected in what ever direction the majority feels right, to stop this dancing around the tree, since it cascades down to other specifications, and these normally use errate and correct errors and ambiguities.

Conclusion

In the current state of xs:anyURI I see no real need for it, since xs:string would do just fine. On the other hand there's no idea in not having some of the most used data types in the standard to ease validation, so I would like to see:

  1. Absolute URL, based on the primary cases for the schemes (protocol) http and https, where (the empty URI should be excluded)
  2. Relative URL

Some of the same argumentation can be used for email, where every one has to create they're own. I know that some times there's a trade off between correctness and performance, but these datatypes would not exclude your possibility for a homebrew.

Read more