Grab Headlines From A Remote RSS File
By Nicholas Chase
2003-12-19
Reader Rating:

Adjusting For Multiple Formats
If all RSS files were like this sample, you wouldn't need to do anything else. Unfortunately, this is not the case. Different vendors and toolkits can produce additional information, or can replace core information with RDF information or other namespaced modules, leading to complaints that supporting RSS is complex because of all the variations. But with the use of XSL transformations, it doesn't have to be that way.
For example, an RSS 2.0 feed might also contain RDF information, like this feed from Typographica:
Listing 6. Excerpt from sample RSS 2.0 message with RDF
<?xml version="1.0" encoding="iso-8859-1"?>
<rss version="2.0"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:admin="http://webns.net/mvcb/"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:content="http://purl.org/rss/1.0/modules/content/">
<channel>
<title>Typographica</title>
<link>http://typographi.ca/</link>
<description>A daily journal of typography featuring news, observations,
and open commentary on fonts and typographic design.</description>
<dc:language>en-us</dc:language>
<dc:creator>Stephen Coles</dc:creator>
<dc:rights>Copyright 2003</dc:rights>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
<admin:generatorAgent rdf:resource="http://www.movabletype.org/?v=2.63" />
<admin:errorReportsTo rdf:resource="mailto:scoles@gomakecontact.com" />
<sy:updatePeriod>hourly</sy:updatePeriod>
<sy:updateFrequency>1</sy:updateFrequency>
<sy:updateBase>2000-01-01T12:00+00:00</sy:updateBase>
<item>
<title>Hot and Cold Fonts</title>
<link>http://typographi.ca/000643.php</link>
<description>LettError have developed a multiple master font
for the Design Institute of the University of Minnesota that varies
along three...</description>
<guid isPermaLink="false">643@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href="http://www.letterror.com/">
LettError</a> have developed a multiple master font for the
<a href="http://design.umn.edu/">Design Institute</a> of the University of
Minnesota that varies along three dimensions: formality, informality, and
"weirdness." (It's apparently possible to be 100% formal and 100% informal at
the same time.) As the New York Times...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-24T00:00:52-08:00</dc:date>
</item>
<item>
<title>Textura Digita</title>
<link>http://typographi.ca/000642.php</link>
<description>CNN reports that the Gutenberg Bible is now available
on the web via the Ransom Center at the University of...</description>
<guid isPermaLink="false">642@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p><a href=
"http://www.cnn.com/2003/TECH/internet/07/23/digital.scripture.ap/index.html">
CNN reports</a> that the Gutenberg Bible is now available on the web via the
<a href="http://www.hrc.utexas.edu/exhibitions/permanent/gutenberg/">Ransom
Center</a> at the University of Texas.</p>
...]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-23T13:16:15-08:00</dc:date>
</item>
<item>
<title>Fight! Fight! Fight!</title>
<link>http://typographi.ca/000640.php</link>
<description>Angry because you had to miss TypeCon &#8217;03?
Work out that aggression with Helvetica vs. Arial....</description>
<guid isPermaLink="false">640@http://typographi.ca/</guid>
<content:encoded><![CDATA[<p>Angry because you had to miss
<a href="http://www.typecon2003.com/">TypeCon ’03</a>? Work out that
aggression with <a href="http://www.engagestudio.com/helvetica/">Helvetica vs.
Arial</a>.</p>]]></content:encoded>
<dc:subject></dc:subject>
<dc:date>2003-07-22T08:52:36-08:00</dc:date>
</item>
...
</channel>
</rss>
|
Notice that this feed actually contains two different descriptions of the content. The first is in the description element, and the second is in the encoded element, which is part of the http://purl.org/rss/1.0/modules/content/ namespace. Here you see the difference in how different feeds handle information. Adam Curry's blog simply encodes information such as links and drops them into the description element, whereas Typographica (or rather the toolkit that produces Typographica's feed) provides a non-markup version in the description element and a full version in the encoded element using a CDATA construct.
Although it is preferable to create a custom presentation for each feed type in order to take advantage of any extra information, this is not always practical from an application development standpoint. But that doesn't mean you have to give up. Instead, you can create a transformation that simply takes different feeds and converts them to a standard structure, which you can then feed to the final transformation.
For example, you can create a stylesheet that takes an RSS 2.0 stylesheet and if it finds an encoded element, uses it to replace any description element:
Listing 7. Transforming RDF information
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:template match="/">
<rss>
<channel>
<xsl:apply-templates select="rss/channel" />
</channel>
</rss>
</xsl:template>
<xsl:template match="title|link|/rss/channel/description|image|text()">
<xsl:copy-of select="." />
</xsl:template>
<xsl:template match="item" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="description" /></description>
</item>
</xsl:template>
<xsl:template match="item[encoded]" >
<item>
<title><xsl:value-of select="title" /></title>
<link><xsl:value-of select="link" /></link>
<description><xsl:value-of select="encoded" /></description>
</item>
</xsl:template>
</xsl:stylesheet>
|
This stylesheet makes copies of the elements that the final stylesheet will need, such as the channel's title and description, and makes a copy of the item with the appropriate description information.
Now you just have to weave that new document into the final transformation:
Listing 8. Chaining the transformation
...
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.dom.DOMResult;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import org.w3c.dom.Document;
public class RSSProcessor {
...
public void setRSSFile(String fileName){
try {
StreamSource interimSource = new StreamSource(fileName);
String XSLSheetName = "2.0.xsl";
StreamSource style = new StreamSource(XSLSheetName);
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document interimDoc = db.newDocument();
DOMResult interimResult = new DOMResult(interimDoc);
TransformerFactory transFactory = TransformerFactory.newInstance();
Transformer interimTransformer = null;
interimTransformer = transFactory.newTransformer(style);
interimTransformer.transform(interimSource, interimResult);
DOMSource source = new DOMSource(interimDoc);
StreamSource finalStyle = new StreamSource("final.xsl");
String outputURL = "headlines.html";
StreamResult result = new StreamResult(new
FileOutputStream(outputURL));
Transformer transformer = transFactory.newTransformer(finalStyle);
transformer.transform(source, result);
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
Take a look at this one step at a time. First of all, you're creating an interim transformation that takes the intial feed and transforms it according to the interim stylesheet in Listing 7, named 2.0.xsl. The result of this first transformation goes not to a file, but to a DOM Document object, which then gets passed as the source for the second transformation.
The name of the interim stylesheet, 2.0.xsl, was deliberate. By naming it after the version, you can create a more flexible system.
First published by
IBM developerWorks
If you found this article interesting, you may want to read these as well:
» Better SOAP Interfaces With Header Elements
» Variable Substitution In XML Documents
» Create JPEGs Automatically With SVG
» Tip: Convert from HTML to XML with HTML Tidy
|