...making Linux just a little more fun!
Last month I wrote about screen scraping with Perl, but all of my examples used RSS. Linux Gazette also offers an Atom feed, as well as a JavaScript that can be used to include each month's headlines, so I thought it would be nice to write about generating these formats.
I described Atom before, in my article about feed readers. I haven't received any complaints about the description I gave, so I assume it's OK, but I'm not going to repeat myself; I'll just say that Atom is the main competition of RSS and is gaining popularity.
Atom, in my opinion, is a more difficult format to work with than RSS, and this is reflected in Perl's modules: RSS can be generated from a single module, but even the simplest tasks in Atom require the use of two modules. (Three if you need to use links!). To be fair, this Yet Another Example of TMTOWTDI -- Atom could easily be generated using an interface like that of XML::RSS, and the Atom interface is as easy to use, if a bit more verbose.
I'm using my User Friendly scraper as my example, because it's the simplest scraper I have.
(text).
#!/usr/bin/perl -w use strict; use LWP::Simple; use XML::Atom::Feed; use XML::Atom::Entry; use XML::Atom::Link; use Date::Format; # These regexes taken from Dailystrips my $patternpre = "<img.+?src=\"(http://www\.userfriendly\.org/cartoons/archives/%y.+?/uf.+?\.gif)\""; my $urlpre = "http://ars.userfriendly.org/cartoons/?id=%Y%m%d&mode=classic"; my $pattern = time2str ($patternpre, time); my $url = time2str ($urlpre, time); my $page = get($url); my $atom = XML::Atom::Feed->new; my $entry = XML::Atom::Entry->new; $atom->title('User Friendly'); my $link = XML::Atom::Link->new; $link->type('text/html'); $link->rel('alternate'); $link->href('http://userfriendly.org/'); $atom->add_link($link); if ($page =~ /$pattern/ig) { $entry->title(time2str("CARTOON FOR %a %b, %Y",time)); my $itemlink = XML::Atom::Link->new; $itemlink->type('text/html'); $itemlink->rel('alternate'); $itemlink->href($url); $entry->add_link($itemlink); $atom->add_entry($entry); } print $atom->as_xml;
OK, so the $link->type
and ->rel
calls are probably not necessary, but it is a bit longer than
generating RSS. Happily, both Atom and RSS generating code can live
in the same script.
There are also modules available that let you generate
JavaScript from feeds. XML::RSS::JavaScript is a subclass
of XML::RSS, so the JavaScript generation step happens at
the same time as RSS generation, you simply change modules, and add
a call to $rss->save_javascript(file)
or
$rss->as_javascript
.
(text).
#!/usr/bin/perl -w use strict; use XML::RSS::JavaScript; use LWP::Simple; use HTML::Entities; use HTML::TokeParser::Simple; my $rss = XML::RSS::JavaScript->new; my $url = "http://www.linux.org.uk/~telsa/Diary/diary.html"; my $page = get($url); my $stream = HTML::TokeParser::Simple->new(\$page); my $tag; $rss->channel(title => "The more accurate diary. Really.", link => $url, description => "Telsa's diary of life with a hacker:" . " the current ramblings"); while ($tag = $stream->get_tag('a')) { next unless $tag->return_attr("name") ne ""; my $link = $tag->return_attr("name"); $tag = $stream->get_tag ('strong'); $tag = $stream->get_token; my $title = $tag->as_is; $tag = $stream->get_tag ('dd'); my $content = ""; $tag = $stream->get_token; until ($tag->is_end_tag('/dd')) { $content .= $tag->as_is; $tag = $stream->get_token; next; } $rss->add_item(title => $title, link => "$url#$link", description => encode_entities($content)); } print $rss->as_javascript; # We can also use $rss->save('file.xml') # as well as $rss->save_javascript('file.js') # to have this script write files.
If you want to generate JavaScript from an existing RSS feed, it's simply done. This script gives me a JavaScript version of my del.icio.us feed:
#!/usr/bin/perl use warnings; use strict; use LWP::Simple; use XML::RSS::JavaScript; my $feed = get('http://del.icio.us/rss/jimregan'); my $rss = XML::RSS::JavaScript->new; $rss->parse($feed); print $rss->as_javascript;
XML::Atom::Feed::JavaScript works like the last example: it converts an existing Atom feed to JavaScript. This isn't a problem, it can simply be called at the end of the Atom generation phase.
This script converts my blog's Atom feed to Javascript: (text).
#!/usr/bin/perl use strict; use warnings; use XML::Atom::Client; use XML::Atom::Feed::JavaScript; my $client = XML::Atom::Client->new(); my $feed = $client->getFeed('http://xpko.blogspot.com/atom.xml'); print $feed->asJavascript();
I did run into a slight problem using version 0.4 of the module: the item links from my blog weren't being fully converted -- I was getting hashrefs to XML::Atom::Links instead of the link (output). With a simple patch I was soon getting the correct output. (David Jacobs, the module's author, managed to beat publication time with the release of 0.5. Thanks David).
Next month, it's back to the task of scraping: (I can say this for certain, because the bulk of the article has already been written) Ben and I will be taking a look at WWW::Mechanize.
Until then, take care!
Jimmy has been using computers from the tender age of seven, when his father
inherited an Amstrad PCW8256. After a few brief flirtations with an Atari ST
and numerous versions of DOS and Windows, Jimmy was introduced to Linux in 1998
and hasn't looked back.
Jimmy is a father of one, a techno-savvy seven year-old called Mark.
When he isn't cutting pieces off himself at work, he likes to play guitar and
read -- not normally at the same time, but the picks make handy bookmarks.
Jimmy likes to collect the off-topic threads that crop up on The Answer
Gang's mailing list, mainly because it's a good excuse for keeping these
threads going.
Jimmy is currently job hunting.