I‘ve now spent some time playing with Yahoo Pipes. Now I have a couple of great Pipes that aggregate and filter worldwide sources for news on China and Japan. As well I have also found a huge number of blogs about China. They are all aggregated on my website DonTai.com. To put all these worldwide sources, filtered by keyword, into one simple web page, for each a linked title and a small description, is great. It saves so much time. You read the title and the small description and you decide to delve in deeper or to pass. It is brilliant.
___It was with some loss of productivity yesterday that I found out that if a Yahoo Pipes construct will not run, for whatever reason, you cannot save the pipe. This is not intuitive. In any programming language, word processing or other user initiated computer work, one can save interim work, regardless of correctness or not. Please imagine while working on an essay the word processor counts your spelling errors (or what it feels is a spelling error), determines that there are too many, and thus deny you the ability to save your interim work. “You must fix these errors right now” it would scream. I felt like whupping Pipes in the butt yesterday, but unfortunately one hole looks the same as the other. Indecision ensued.
___It is only in juggling that interim work cannot be saved. You either finish gracefully, or you, eh, dropped the ball, so to speak. Maybe Pipes developers also juggle.
___Not only can you not save your work if a construct does not work, but the error message is quite misleading. “Oops: System error. Problem parsing response.” Talk about your stereotypically misleading error message. I thought we IT people had evolved much further than this. An error message should be displayed as a response to a user action. The disconnect here is that my action was saving my Pipe, and the error is not directly connected to this act. Is it too much to ask that the error message indicate the origin of the error? How about displaying this message when I run the Pipe and it does not work? And even if something is not correct, I’ll fix it later. Just let me save my work. But I digress.
___As I continued to work on my Pipes I kept running them and looking at the Really Simple Syndication (rss) output feeds. For some reason, some news articles in my feeds were not coming out with summaries, but with the complete article, some multiple pages long. It’s not only counterproductive but annoying. I research rss 2.0 feeds and find the article summary should be in a field called item.description. I therefore put in a command that truncates item.description to no more than 400 characters. This does not work, and full length articles still appeared.
___After a couple more hours trying to crack this nut, I try to find a tool on the internet that can display rss record by record, field by field. I cannot. It turns out that within Pipes there is a display at the bottom of the screen. If you click on an individual news item, all their field names and contents are displayed. This function is as useful as it is impressive.
___I find out that, to my chagrin, all rss 2.0 feeds are not created equal. Rss feeds also may not adhere to the supposed standard. Looking feed by feed, I found out that some feeds put their summary into item.description, while others use the variable item.content:encoded. Those feeds that use item.description are written according to the rss standard, have no item.content:encoded field. Conversely those that used item.content:encoded have no variable called item.description. I was following the standard and truncating item.description for all feeds. If there was no such variable in a feed, a Pipes error occurred, thus not allowing me to save my interim work. To complicate matters, some feeds had both variables, but only used one.
___I therefore needed two different rules for the summary. For those that use item.description, limit this field to to 400 characters. Similarly, for those that use item.content:encoded, limit this field. Finally, both Pipes and I were satisfied, and I was able to save my work.
___While I’ve progressed greatly, there still are issues I am working on. My DonTai.com cron job seems to fail when pulling in news feeds from my China Japan News Yahoo Pipes. It seems I need to try a couple of times before the cron job can suck in all the news. It’s possible that this is an issue because I’m playing with the feeds and then sucking in the completely aggregated data. Once initially sucked in there should be less residual load. There is also the setting of the frequency of which news aggregates and how this synchronizes with the cron jobs. A news feed that is not yet scheduled to run will not run on a cron job. A news feed that is scheduled to run is then dependent on the next cron job.
___I question the robustness of Pipes. Will it finish a query and run to completion, or will it time out and leave me hanging. Should I cut up a large Pipes query into smaller ones? Maybe I need to research how to get pipes to run as efficiently as possible. On my Pipes diagrams I’ve already eliminated any redundancy I could find. I’ll let it run for a while and see.
I’m not surprised that you’re having problems with RSS. Way back when I hadn’t been blogging for long — Blogspot problems, way back in March 2006! — I discovered that RSS isn’t the best specified protocol.
From listening to podcasts, my understanding is that Atom is a well-designed standard, much better than RSS 2.0 . That being said, an Atom feed isn’t always available and my experience is that there’s sometimes issues with getting an Atom feed recognized by offline feed readers.
I don’t know Yahoo Pipes well, but see that Yahoo says Pipes offers output in RSS 2.0, RSS 1.0 (RDF), JSON and Atom formats for maximum flexibility. I wasn’t sure which you chose to aggregate into Google — or if it makes a difference — which is exactly the sort of detail that I’ve been ducking by not engaging with the mashup tools.