RSS Nightmare

·

One of the main reasons for developing Atom was the problem with RSS 2.0 where the type of content in some elements is unknown: the title and description elements can contain either text or escaped HTML, and there's no way of knowing which is which. In Atom there's an explicit type attribute on elements (with 'text' being the default for titles and content).

Generally aggregators treat the body of an RSS item as HTML and the title as text, though it varies (in NetNewsWire the title of an item can be shown as text in the title list pane and HTML in the content pane, both at the same time) and some are intelligent, only unescaping where it seems appropriate. Combine this with feed parsers that might double-unescape escaped HTML entities, plus the need to escape or strip dangerous elements for displaying on a web page, and you get all kinds of problems. Here's a great example:

  1. Someone posts to the Yahoo UI group with a message containing <script> in the title.
  2. The Yahoo group has an RSS 2.0 feed which encodes that as <title>&lt;script&gt; tag inside yui panel</title>. At this point it's already ambiguous whether the author was inserting a <script> element in the title, or just writing '<script>'.
  3. Firefox decides the title is text rather than escaped HTML, so displays that as <script> tag inside yui panel.
  4. However another message has The latest post with &quot;Re: &lt;script&gt; ... &quot; in the description field. Firefox believes this to be HTML, so displays " for the quote entity and strips out the script tag, leaving The latest post with "Re: ... " is not being escaped on the page
  5. As that message suggests, the YUI developer front page aggregated the feed and didn't escape or strip the script elements, so the script element got embedded in the page and broke the rest of it (also a security hole as the script element could have contained extra code).
  6. 2lmc noticed and wrote about it, escaping the content properly as &lt;script&gt; for the HTML page and producing an RSS 1.0 feed which (as RSS 1.0 elements are also ambiguous about text vs HTML) double-escaped the angle brackets to make &amp;lt;script&amp;gt;
  7. Firefox deals with that as escaped HTML again and displays <script> in the HTML preview of the item body.
  8. Subscribing to the feed directly in NetNewsWire would presumably have correctly displayed the same as Firefox, but because I have syncing with Newsgator turned on, all my subscriptions get routed through Newsgator's server which normalises them all to RSS 2.0 (argh!), unescaping the HTML entities once in the process so that by the time it gets to NetNewsWire the description field contains <script> again, where it's treated as a script element in HTML (including everything after it, as the tag isn't closed). Even without Javascript enabled in NetNewsWire, that leaves this:

    2007-07-31-Escaping

  9. Newsgator Online has given up completely, and isn't showing any of the items that are currently in the 2lmc feed.

If everyone used Atom (and took care to strip dangerous elements from HTML before displaying entries), this wouldn't be a problem.