Tuesday, August 11, 2009

Resolving the XHTML1 DTD locally - avoiding problems with the W3C 503 access blocking

pencil icon, that"s clickable to start editing the post

Currently I'm doing a project that need some simple web scraping where I'm using CyberNeko HTML Parser (nekoHTML) which relies on Xerces. As I was running my first simple spike I encountered:

Exception in thread "main" java.io.IOException: 
Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Whaoo, my first guess was that the URL was wrong, but it turned out to be okay. After a quick search I found the reason to be a deliberate move as can be read on W3C Systems Team Blog post "W3C's Excessive DTD Traffic". The essence of the problem is:

..In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.

Oouch, that's a lot of bandwidth. I've heard of a simliar problem, though of much smaller scale, with Schemas for OIOUBL Invoice version 0.7 from the former ISBs (Info Structure Base) repository.


Lucklily it is quite easy to do right yourself with XML Catalogs, pulling from a local repository. cbowditchs guide "Resolving DTD system URI with XMLCatalogResolver" helped me a lot and I didn't have any problems with it like he had. First you'll need to fetch the DTD and it's depencies:

create a catalog file:

    1 <?xml version="1.0"?>
    2 <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
    3   <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
    4           uri="xhtml1-transitional.dtd"/>
    5 </catalog>

and adding it's use to the code like:

XMLCatalogResolver resolver = new XMLCatalogResolver();
resolver.setCatalogList(new String[] {"etc/dtd/xhtml1-catalog.xml"});

DOMParser parser = new DOMParser();
parser.setProperty("http://apache.org/xml/properties/internal/entity-resolver", resolver);


flakstad said...

Thank you very much for posting this article. A great tip that aids both me and others who are getting 503s, as well as the W3C. Although I'm not quite sure where you placed your local DTD and the catalog file, I'm sure I'll figure it out. Thanks.


Sweetxml said...

Hi flakstad
Thank you for your comment. I expect have figured it out easily, but just to anyone who wonders.
The catalog file was just placed relative to my build, and the DTD along with the entity files where placed in the same folder. You can place them anywhere you like as long as you correct the references.

Best regards

Henry said...

This is a great article and it helped resolve the issue.

But I am not able to find the correct dir to keep the catalog files. PLease note that I am using this in a web application and the java class is called from a JSP. So how does this work for a web app. I am using Tomcat 1.5 with java 5.

Sweetxml said...

Hi Henry

Thank you, glad it was of value to you.

It's been some time since I looked at it. I'll suggest you look closer at XMLCatalogResolver for clues. Maybe overriding resolveResource could be a way to go.

In general if you want to load files to the webapp you either make it public available just like images, html- and jsp-files or place them in the classpath ex. /WEB-INF/classes/ and bootstrap the class loader from the current object. Doing a quick search I found this guide: URL to load resources from the classpath in Java.

Maybe there's help to found in this article:
Tip: Load resources from the classpath
. I didn't really read it but it occured to be relevant.

Best regards

Murali said...

Hi Brian,

Thats a good article. May i know how to fetch the dtd's (including dependencies) locally? , bcoz when i visit to the site of dtd's it shows a redirection url.


Sweetxml said...

Hi Murali

I'm not sure I really understand your problem, since when i GET it i have no problem and get a clean status of 200:

$ wget -S http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
--09:20:29-- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
=> `xhtml1-transitional.dtd.1'
Resolving www.w3.org...,,, ...
Connecting to www.w3.org||:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 21 Jul 2010 07:20:29 GMT
Server: Apache/2
Content-Location: xhtml1-transitional.dtd.raw
Vary: negotiate,accept-encoding
TCN: choice
Last-Modified: Thu, 01 Aug 2002 18:37:56 GMT
ETag: "7d6f-3a72ac59d0900;475d1b7e9a540"
Accept-Ranges: bytes
Content-Length: 32111
Cache-Control: max-age=7776000
Expires: Tue, 19 Oct 2010 07:20:29 GMT
P3P: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Connection: close
Content-Type: application/xml-dtd; charset=utf-8
Length: 32,111 (31K) [application/xml-dtd]

Brgds Brian

Murali said...

Never mind, i just tried to open a xhtml pointing to strict dtd "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" in IE and it never returned anything.

And when tried programatically it gave 503 error.So just checked with you.

I have fixed it by downloading the dtd through some other browser & with local dtd it works like a charm.

Thanks for your response.