The Complexities of Really Simple Syndication

The RSS (Really Simple Syndication) Logo. RSS is so useful but abused.

The RSS (Really Simple Syndication) Logo. RSS is so useful but abused.

Should not an internet feature with the label “Really Simple” in its name, be in fact really simple? After working with Really Simple Syndication (RSS), unfortunately this is not or is no longer true. In fact wider variations of this very useful tool threaten to reduce the effectiveness of this useful tool.

Wikipedia has a great definition of RSS:

RSS (most commonly expanded as “Really Simple Syndication”) is a family of web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format.[2] An RSS document (which is called a “feed”, “web feed”,[3] or “channel”) includes full or summarized text, plus metadata such as publishing dates and authorship. Web feeds benefit publishers by letting them syndicate content automatically. They benefit readers who want to subscribe to timely updates from favored websites or to aggregate feeds from many sites into one place.

For for content publishers and webmasters not familiar with RSS there is a great tutorial online.

Surprisingly the last near “approved” standard was RSS 2.0, agreed upon and published by the W3C in August 2002. Harvard Law also published this standard but dated July 2003. I will not nitpick these dates in the context that we are now in the year 2010 and RSS has splintered badly. it has been 8 years since and it shows.

There has been some undated work by Aaron Swartz on RSS 3.0. I see no widespread agreement for his proposal. While this is disconcerting, I note that his proposed spec also greatly diverges from how RSS is used today. In fact, you really need to dig deep into Google to find his document.

So how does RSS work? When you publish a document to the web, RSS allows you to also publish a “teaser”, which includes your document’s title, a short description and a link back to your document. These teasers are collected to make a “feed” for your web site. Feeds are collected or aggregated when other readers, search engines or others subscribe to your feed. As you publish a new document, all those who have subscribed to your feed get informed and can then read your document. You need not tell others you have created a new document. In a way RSS is like a one way document messaging system.

Unfortunately RSS has evolved away from a simple and efficient messaging system to a content delivery system. Instead of providing a short and sweet title, description and link of a document, many web publishers are providing their whole document in the RSS feed. This often includes images, HTML markup and a whole lot of superfluous content. Many RSS feeds are now very heavy, and as a result, this slows down the free flow of information to subscribers. If the original document changes there is also no way to go back to all RSS subscribers and change their content.

Long ago a feed would include a title, short description and a link to the document. To me a short description is a at most a couple of sentences. Today’s feeds can have text descriptions of 500 characters, far longer than the much shorter 140 character tweet. Can we have some restraint here, people? Write more efficiently. Some webmasters copy their content:encoded field (see below) into the description field, so now you not only get tons of HTML markup but also images and unnecessary weight. Remember, people, the description should be long enough so that your subscriber can decide if your article is worthwhile to read, and then will LINK to your document. It’s not there to deliver your 40k document. Short and sweet is best.

The content:encoded field is most contentious and makes me the most upset. Firstly, it is not part of ANY RSS standard that I can find, including the proposed RSS 3.0. Who created this is unknown, but it seems to have spread faster than the swine flu H1N1. I would estimate that about 50% of all blog RSS feeds now have the field item.content:encoded. In fact, even my own blog, the one you are reading, has this field in the RSS feed. Did I add this field to my feed just to provoke myself? Why no, all WordPress blogs do RSS this way. No wonder this field is spreading like wildfire. It is hidden underneath the beautiful exterior of the WordPress facade.

The purpose of item.content:encoded is to slam the whole document, HTML markup, images, attachments and all, into the RSS feed. No wonder that the RSS feeds are big, fat and of course slow to read. Worse than providing your whole document into the content:encoded field is to copy, verbatim, all contents into the item.description field. Why provide your complete document twice? Why do you feel the need to provide your complete document even once? When someone like myself reads your RSS feed, parses your description, decides a 50k description is too long and therefore truncates the first 200 characters, all that is left is useless HTML markup, resulting in no description being sent to your subscriber. Thankfully WordPress provides the first 100 characters of text, sans images, in item.description, and self truncates.

To those that like to aggregate many web sites using RSS I have some pointers:

  • You need to examine each new feed to determine if the description field is too long, if content:encoded is present, and if there are any surprises in the RSS feed. Expect surprises, as in the World Bank blogs (see below)
  • Truncate item.description at 200 characters. For the feeds that simply copy their content:encoded field to description you will get mostly HTML markup, which means no description. So be it.
  • Delete the content:encoded field. Rename it to something non-standard and the feed reader should drop it. The feed will lose a LOT of weight and processing will be much faster.
  • Delete any other non-standard field, including all those that provide the whole document. Delete any HTMP markup, as it is useless in RSS.
  • Hope that a new RSS standard appears in the future.

What was designed as a highly efficient internet document messaging system has now bogged down. We need not see your whole document in the RSS feed, as the reader can follow your provided link and get to your original document. Put the “S” of simple back into “Really Simple Syndication and the world will be a better place.

Apart from certain feeds not providing a published date, the most “infamous” RSS feed has got to be blogs from the World Bank. The field description has the whole document in HTML markup, including photos and attachments. The World Bank introduces three new fields: itunes:keyword (tags), itunes:subtitle (a 200 character summary of the article), itunes:summary (the whole article in text without HTML markup or images). Why provide new and non-standard RSS feed items for only this new technology? Why is there a need to provide content twice and not put the summary in the standard RSS field? A note to anyone that mashes the World Bank blogs RSS: delete the itunes:summary field, copy the itunes:subtitle field to item.description (then delete the other itunes fields) and lighten the RSS feed by at least half.

RSS is remains a very useful internet document messaging system that allows for the free flow of documents throughout the internet. Please let us not abuse the system.

2 thoughts on “The Complexities of Really Simple Syndication

  1. Aaron Swartz

    In fairness, RSS 3.0 was mostly a joke. I was making fun of the fact that Dave Winer renamed RSS 0.92 as RSS 2.0 so that people would use it instead of RSS 1.0. I do have to take the blame for content:encoded, though.

  2. David Ing

    @dontai While I appreciate your plea for shorter content in an RSS feed, I prefer to read my blogs in a feed reader rather than in a browser. Thus, I find truncated entries a chore … and have to keep them in a separate category in my feed reader.

Leave a Reply

Your email address will not be published. Required fields are marked *