.
Developer Spot - Web Development Tutorials
arrowDeverloper Spot  Tutorials  XML  Tip: Convert from HTML to XML with HTML Tidy 
 
Development Tutorials
ASP
CGI & Perl
CSS
HTML
Java
JavaScript
Linux
PHP
XML




More Resources
Web Hosting Articles
Web Development News
PHP Manual
Web Hosting Directory
Budget Web Hosting Linux Web Hosting Small Business Hosting
Windows Web Hosting Reseller Web Hosting Web Hosting Articles

Tip: Convert from HTML to XML with HTML Tidy

By Benoit Marchal
2003-12-16
Reader Rating: 5 out of 5
Bookmark Print Version
Tidying Up

Obviously, the first step is to download and install HTML Tidy (which you'll find in Resources). HTML Tidy is available on most platforms, including Windows, Linux, and MacOS. The default executable is a command-line tool, but GUI versions are available for Windows and MacOS.

To run HTML Tidy, open a terminal and issue the following command:
tidy -asxhtml -numeric < index.html > index.xml


That's it! HTML Tidy immediately converts index.html into index.xml. HTML Tidy will print messages that highlight issues with the original HTML document during the conversion. In most cases, you can safely ignore these messages.

HTML Tidy runs as a filter, so it expects standard input and it prints the result to the standard output. The redirection operators (< and >) allow you to work with files. By default, HTML Tidy produces a clean HTML page, but you can set two options to output XML, instead:

-asxhtml outputs XHTML documents instead of HTML.

-numeric uses character entities instead of HTML entities. For example, î is replaced with î.


XPaths and empty elements

You must be careful when processing XHTML documents with XSL. XHTML is primarily a formatting language and, unlike other XML vocabularies, it adds little structure to documents. To recover the structure, you have to analyze the document and carefully craft the appropriate XPaths. In this example, it was not immediately obvious how to separate the image title from its description: There's only a line break (<br/>) between them. Because the line break is an empty tag, it's not enough to select it to retrieve the text! Ultimately, I used the preceding-sibling axis to load the text before the empty tag.



The difference between XHTML and HTML might sound trivial (it's only an extra "X" after all) but it is important. XHTML is a version of HTML 4.01 that has been adapted to the XML syntax. The vocabulary is unchanged (XHTML uses the familiar <p>, <b>, and <a> tags, for example), but the syntax is XML, so it merges nicely in an XML workflow.

The main differences between HTML and XHTML are:

  • XML elements must have opening and closing tags. HTML does not require the closing tag for many elements, such as <p> unless they are empty elements.

  • Empty elements follow the XML convention. For example, the line break is written as <br /> instead of <br>.

  • Attribute values are always quoted (for example, <a href="http://www.marchal.com"> instead of <a href=http://www.marchal.com>).


  • Listing 2 is the file that HTML Tidy produces when Listing 1 is provided as input. As you can see, it is a valid XML document, and it takes surprisingly little work to produce it.


    Article Pages:
    Preserve Legacy Web Sites With This Handy Utility
    Tool Of The Trade
    Listing 1. index.html (an excerpt)
    Tidying Up
    Listing 2. index.xml (an excerpt)
    Further Processing
    Listing 3. index-transform.xml (an excerpt)
    Listing 4. cleanup.xsl
    Conclusion

    First published by IBM developerWorks


     Rate this article:   Poor          Excellent 


    If you found this article interesting, you may want to read these as well:

    » Better SOAP Interfaces With Header Elements

    » Variable Substitution In XML Documents

    » Create JPEGs Automatically With SVG

    » Grab Headlines From A Remote RSS File



     
    Development Tutorials: CGI & Perl - CSS - HTML - Java - JavaScript - Linux - PHP - XML
    More Resources: Web Hosting Articles - Web Development News - PHP Manual