Increasing Efficiency of Drupal 6’s Feed Aggregator

Drupal 6's Feed Aggregator works well

Drupal 6's Feed Aggregator works well


I really love Drupal 6’s feed aggregator. It has the ability to aggregate numerous RSS (Really Simple Sybndication) news or blog feeds, categorize them, and keep them current, all in the effort to save you the trouble of going to each news source and hunting for relevant news. All you have to do is chose a category, read the summary and click the link for the article you what you want.

___What the feed aggregator will not do is filter your feeds by keyword. As well, when the RSS feeds are displayed, they can have short descriptions or full article descriptions. These full article descriptions clutter your screen, which I did not want. All I wanted was a link, a short description, and no more. When you are looking at hundreds of articles, less is indeed more.

___My solution was to use Yahoo Pipes. My current Pipes help aggregate 40 blogs (mostly from China), 40 news feeds (world wide), as well as aggregate about 100 other blog sites fed from yet another Drupal site (mostly from Asia). Yes, you can daisy chain Drupal news aggregators! With Yahoo Pipes I was able to filter by keyword, as well as go down to the RSS variable level and truncate full article descriptions. I was also able to eliminate duplicate entries and sort by article date.

___The difficulty was that my cron jobs were not completing and therefore not updating my feeds. I would get a MySQL server has gone away error, which means that my Yahoo Pipes were taking too long to complete. My Site5 host provider said that MySQL has a timeout of 15 seconds. One timeout will kill the cron job, leaving the other news feeds not updated. The Cron jobs were returning 1.2MB to 2.5MB error logs of not very helpful information. Web searches yielded very little on the MySQL error message, how to get my cron jobs to complete, and on increasing the efficiency of Yahoo Pipes.

___My original cron strategy was that I only needed to update my feeds every 3 hours, so I scheduled cron as such. What happens is that all the feeds need to be updated at once, the cron job gets overwhelmed, dies, and most of the feeds do not update.

___An alternative cron strategy proved to be the solution. I scheduled cron to run hourly, even though my feeds did not need to be updated no more than every 3 hours. A staggering pattern organically emerged. Those jobs that were easily completed were quickly done, and did not need to be updated for the next 2 cron jobs. Those jobs that took longer initially timed out. During the next cron job, because the easy feeds were already completed, there were far fewer feeds to run, so the cron jobs could concentrate on these stubborn few. In the end they were also completed.

___I used some other Yahoo Pipes strategies. I broke up one larger Yahoo Pipe into two, though I think I will try merging them together later. These larger Pipes should be run hourly, even though they may only change every 3 hours. If run hourly they will return with fewer updates but more quickly and therefore will have less of a tendency to time out.

___Take a look at my feed aggregator at DonTai.com and my Yahoo Pipes if you’re interested.

___I tried other alternate RSS solutions but they did not fulfill my requirements. Google Reader will aggregate RSS feeds but does not provide any filtering capability. Feedrinse will aggregate RSS feeds and filter, but there is no way to truncate full length RSS feeds. I could use a feed reader, but then would not be able to share my aggregated feeds with multiple people on the internet. A feed reader may be better for those who WANT the full article anyway.

Leave a Reply

Your email address will not be published. Required fields are marked *