WWW::Harvester - Module for harvesting web pages.
use WWW::Harvester;
my $h = new WWW::Harvester ( Pages => [ 'http://www.site.com/', 'http://www.other-site.com/' ],
Email => 'your@address.org',
FetchDelay => 10,
DepthLimit => 5
);
my $cr = sub {
my ($response) = @_;
print $response->status_code, "\n";
}
$h->handler('200', $cr, 'response');
$h->harvest();
WWW::Harvester is a module that at it's core simply spiders the web. It provides a mechanism for supplying custom page handlers, one for each HTTP 1.1 status code. If no page handlers are supplied, the harvester does dead link detection, writing a report to standard out when it's finished.
The harvester does a breadth first traversal, and the depth to which it will traverse can be limited.
$h = new WWW::Harvester(%options);$h->handler(code, \&subroutine, argspec);$h->handler(code, method_name, argspec);Code is an HTTP 1.1 status code.
Subroutine is a reference to a subroutine which is called to process the page.
Method_name is the name of a method of $h which is called to process the page.
Argspec is a string containing a comma separated list that describes the information reported by the event. The following argspec identifier names can be used:
$h->harvest( uri );$h->headRequest( $uri );
LWP::UserAgent, http://www.perldoc.com/perl5.6.1/lib/LWP/UserAgent.html
HTML::Parser, http://www.perldoc.com/perl5.6.1/lib/HTML/Parser.html
WWW::RobotRules, http://www.perldoc.com/perl5.6.1/lib/WWW/RobotRules.html
http://www.ietf.org/rfc/rfc2616.txt
Devon Smith <smithde@oclc.org>
Copyright 2001 OCLC. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.