Saturday, November 8, 2008

The xml:lang attribute, it's use in XML Schema and SAML V2.0 Metadata

pencil icon, that"s clickable to start editing the post

The OASIS standard for Federated/Single Sign-On SAML V2.0 has a format for exhanging (meta)data about Identity/Service provider. The syntax is defined the XML Schema saml-schema-metadata-2.0.xsd and is futher described (the semantics) in saml-metadata-2.0-os.pdf. In this blogpost I'll focus on the less important use of xml:lang in this format and look at the general use of it.

Reusing xml:lang in XML Schema

It's the contents of the Organization that is parametrized with the xml:lang attribute:

This involves importing the namespace in the XML Schema (it'a all ready defined and we just want to use it).

   19     <import namespace="http://www.w3.org/XML/1998/namespace"
   20         schemaLocation="http://www.w3.org/2001/xml.xsd"/>

This it then used ind the definitions of two elements:

The localizedNameType complex type extends a string-valued element with a standard XML language attribute.

   36     <complexType name="localizedNameType">
   37         <simpleContent>
   38             <extension base="string">
   39                 <attribute ref="xml:lang" use="required"/>
   40             </extension>
   41         </simpleContent>
   42     </complexType>

The localizedURIType complex type extends a URI-valued element with a standard XML language attribute.

   43     <complexType name="localizedURIType">
   44         <simpleContent>
   45             <extension base="anyURI">
   46                 <attribute ref="xml:lang" use="required"/>
   47             </extension>
   48         </simpleContent>
   49     </complexType>

This is then put to use in the definition of Organization:

The <Organization> element specifies basic information about an organization responsible for a SAML entity or role. The use of this element is always optional. Its content is informative in nature and does not directly map to any core SAML elements or attributes.

  120     <element name="Organization" type="md:OrganizationType"/>
  121     <complexType name="OrganizationType">
  122         <sequence>
  123             <element ref="md:Extensions" minOccurs="0"/>
  124             <element ref="md:OrganizationName" maxOccurs="unbounded"/>
  125             <element ref="md:OrganizationDisplayName" maxOccurs="unbounded"/>
  126             <element ref="md:OrganizationURL" maxOccurs="unbounded"/>
  127         </sequence>
  128         <anyAttribute namespace="##other" processContents="lax"/>
  129     </complexType>
  130     <element name="OrganizationName" type="md:localizedNameType"/>
  131     <element name="OrganizationDisplayName" type="md:localizedNameType"/>
  132     <element name="OrganizationURL" type="md:localizedURIType"/>

The documentation contains a metadata example for a Service Provider where the Organization is defined as:

   55   <Organization>
   56     <OrganizationName xml:lang="en">
   57       Academic Journals R US
   58     </OrganizationName>
   59     <OrganizationDisplayName xml:lang="en">
   60       Academic Journals R US, a Division of Dirk Corp.
   61     </OrganizationDisplayName>
   62     <OrganizationURL xml:lang="en">
   63       https://ServiceProvider.com
   64     </OrganizationURL>
   65   </Organization>

and note that the xml namespace is predefined, it would be an error to explicitly redefine it.

xml:lang and it's use in XML document schemas

In section 2.12 Language Identification of the XML Specification, it says:

In document processing, it is often useful to identify the natural or formal language in which the content is written. A special attribute named xml:lang may be inserted in documents to specify the language used in the contents and attribute values of any element in an XML document. In valid documents, this attribute, like any other, must be declared if it is used. The values of the attribute are language identifiers as defined by [IETF RFC 3066], Tags for the Identification of Languages, or its successor; in addition, the empty string may be specified.

In the FAQ: xml:lang in XML document schemas. In the bye the way it says how to use xml:lang in XML Schema:

XML Schema requires that the xml namespace be declared and imported before using xml:lang (and other xml namespace values)

For a small discussion on the reuse of attributes form the XML Specification you can look as this mail thread from xml-schema list.

The also a section on When to use your own element or attribute:

When the language value is really an attribute of or metadata about some external content, then xml:lang is not an appropriate choice. In these cases you want to store language information, but the language doesn't refer to the content of the XML document (or included content, such as images, which are processed as part of the document) directly. In this case you should define an element or attribute of using a different name and not use the xml:lang attribute. The value of the element or attribute should use RFC 3066 (or its successor), just like xml:lang.

An ups, this disqualifies it's use on OrganizationURL since this is referring to another document. In real life this is a no-problem, and if not for anything else then because the Organization element is optional.

The value of reuse

A more interesting dicussion is the value of reuse, which varies greatly from both where it's applied and used.

The XML Schema itself reuses the type localizedNameType several times as an syntactic component, since it very generic only constraining to Name something and with an attribute to described the language. The URL variant is only used once and in general since an empty value is allowed I would have liked the attribute to be optional instead of required.

The value of reusing xml:lang could be argued as minimal, since it would be easy to redefine without to much struggle, and the level of generic support for this attribut is in my opinion limited and tied to the domain/application use.

Update! Since making this post I've had some experience with it in XMLBeans which actually has some built in checks that surpass the definition in the xml.xsd:

   92  <xs:attribute name="lang">
   93   <xs:annotation>
   94    <xs:documentation>Attempting to install the relevant ISO 2- and 3-letter
   95          codes as the enumerated possible values is probably never
   96          going to be a realistic possibility.  See
   97          RFC 3066 at http://www.ietf.org/rfc/rfc3066.txt and the IANA registry
   98          at http://www.iana.org/assignments/lang-tag-apps.htm for
   99          further information.
  100 
  101          The union allows for the 'un-declaration' of xml:lang with
  102          the empty string.</xs:documentation>
  103   </xs:annotation>
  104   <xs:simpleType>
  105    <xs:union memberTypes="xs:language">
  106     <xs:simpleType>
  107      <xs:restriction base="xs:string">
  108       <xs:enumeration value=""/>
  109      </xs:restriction>
  110     </xs:simpleType>
  111    </xs:union>
  112   </xs:simpleType>
  113  </xs:attribute>

For one it checks that the overall syntax is correct, ex. It'll call it an error if I use da_DK instead of da-DK and doesn't allow for it to be empty like the xml.xsd states (which is fine by me since I personally dislikes empty elements and attributes in data-centric scenarios).

Language Identifiers (RFC 3066)

The page Using Language Identifiers (RFC 3066) is great for quick brush up. Being a Dane my interest is on the three examples for Denmark:

da-DK (Danish) de-DK (German)

da-DE (Danish)

That's for the majority speaking danish in Denmark, the minority speaking danish in Germany and german in Denmark. Since both Denmark and the choice for speaking danish is very liminite in geographic and individuals, it doesn't really make sense to me to go for anything other than da,da-DK or de,de-DE.

A great source for quick trip around Web Internationalization Standards and Practice is the presentation/tutorial [PDF] that can be found on that page.

0 comments :