Tuesday, August 11, 2009

Resolving the XHTML1 DTD locally - avoiding problems with the W3C 503 access blocking

pencil icon, that"s clickable to start editing the post

Currently I'm doing a project that need some simple web scraping where I'm using CyberNeko HTML Parser (nekoHTML) which relies on Xerces. As I was running my first simple spike I encountered:

Exception in thread "main" java.io.IOException: 
Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Whaoo, my first guess was that the URL was wrong, but it turned out to be okay. After a quick search I found the reason to be a deliberate move as can be read on W3C Systems Team Blog post "W3C's Excessive DTD Traffic". The essence of the problem is:

..In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.

Oouch, that's a lot of bandwidth. I've heard of a simliar problem, though of much smaller scale, with Schemas for OIOUBL Invoice version 0.7 from the former ISBs (Info Structure Base) repository.

Solution

Lucklily it is quite easy to do right yourself with XML Catalogs, pulling from a local repository. cbowditchs guide "Resolving DTD system URI with XMLCatalogResolver" helped me a lot and I didn't have any problems with it like he had. First you'll need to fetch the DTD and it's depencies:

create a catalog file:

    1 <?xml version="1.0"?>
    2 <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    3   <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
    4           uri="xhtml1-transitional.dtd"/>
    5 </catalog>

and adding it's use to the code like:

..  
XMLCatalogResolver resolver = new XMLCatalogResolver();
resolver.setCatalogList(new String[] {"etc/dtd/xhtml1-catalog.xml"});

DOMParser parser = new DOMParser();
parser.setProperty("http://apache.org/xml/properties/internal/entity-resolver", resolver);
..

Read more