Tip: Convert from HTML to XML with HTML Tidy
By Benoit Marchal
2003-12-16
Reader Rating:

Tool Of The Trade
The basic tool you can use to upgrade a site from HTML to XML is HTML Tidy. Originally developed by Dave Raggett and distributed under an open source license through the W3C Web site, HTML Tidy is now maintained by a group of volunteers at SourceForge. A Java-language version (aptly called JTidy) is also available (see Resources). Last but not least, an API allows you to integrate HTML Tidy as a library in your applications.
HTML and XML are both markup languages derived from SGML, so they have a lot in common. Still, there are two major differences:
XML syntax is far more restrictive; most importantly, in XML you must remember to close the tags.
HTML coding often has been relatively careless, so the files are rarely trouble-free to start with.
Early Web browsers encouraged sloppiness among webmasters by being extraordinarily tolerant of errors. At the time, the goal of these browsers was to get as many people on board as possible and to encourage webmasters to publish documents. The strategy worked, and Web content grew exponentially.
Still, poor coding practices caused all kind of incompatibilities, and HTML Tidy was originally designed to address this. It rewrites HTML pages to be conformant with the latest W3C standards. In the process, it fixes many common errors such as unclosed tags.
Although HTML Tidy primarily works with HTML pages, it also supports XHTML, an XML vocabulary.
As an example, I will work with a photo gallery generated with Photoshop. You can use other HTML documents, but if you'd like to experiment with the same files I use, the gallery is also available for download in the Resources section. Listing 1 is an excerpt from the gallery -- as you can see, it's plain HTML code.
First published by IBM developerWorks
If you found this article interesting, you may want to read these as well:
» Better SOAP Interfaces With Header Elements
» Variable Substitution In XML Documents
» Create JPEGs Automatically With SVG
» Grab Headlines From A Remote RSS File
|