Currently I'm doing a project that need some simple web scraping where I'm using CyberNeko HTML Parser (nekoHTML) which relies on Xerces. As I was running my first simple spike I encountered:
Exception in thread "main" java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
Whaoo, my first guess was that the URL was wrong, but it turned out to be okay. After a quick search I found the reason to be a deliberate move as can be read on W3C Systems Team Blog post "W3C's Excessive DTD Traffic". The essence of the problem is:
..In particular, software does not usually need to fetch these resources, and certainly does not need to fetch the same one over and over! Yet we receive a surprisingly large number of requests for such resources: up to 130 million requests per day, with periods of sustained bandwidth usage of 350Mbps, for resources that haven't changed in years.
Oouch, that's a lot of bandwidth. I've heard of a simliar problem, though of much smaller scale, with Schemas for OIOUBL Invoice version 0.7 from the former ISBs (Info Structure Base) repository.
Solution
Lucklily it is quite easy to do right yourself with XML Catalogs, pulling from a local repository. cbowditchs guide "Resolving DTD system URI with XMLCatalogResolver" helped me a lot and I didn't have any problems with it like he had. First you'll need to fetch the DTD and it's depencies:
- http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
- http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent
- http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent
- http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent
create a catalog file:
1 <?xml version="1.0"?> 2 <catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog"> 3 <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN" 4 uri="xhtml1-transitional.dtd"/> 5 </catalog>
and adding it's use to the code like:
..
XMLCatalogResolver resolver = new XMLCatalogResolver();
resolver.setCatalogList(new String[] {"etc/dtd/xhtml1-catalog.xml"});
DOMParser parser = new DOMParser();
parser.setProperty("http://apache.org/xml/properties/internal/entity-resolver", resolver);
..
3 comments :
Thank you very much for posting this article. A great tip that aids both me and others who are getting 503s, as well as the W3C. Although I'm not quite sure where you placed your local DTD and the catalog file, I'm sure I'll figure it out. Thanks.
Cheers,
Paul.
Hi flakstad
Thank you for your comment. I expect have figured it out easily, but just to anyone who wonders.
The catalog file was just placed relative to my build, and the DTD along with the entity files where placed in the same folder. You can place them anywhere you like as long as you correct the references.
Best regards
Brian
This is a great article and it helped resolve the issue.
But I am not able to find the correct dir to keep the catalog files. PLease note that I am using this in a web application and the java class is called from a JSP. So how does this work for a web app. I am using Tomcat 1.5 with java 5.
Post a Comment