NAME

WWW::Harvester - Module for harvesting web pages.


SYNOPSIS

  use WWW::Harvester;
  my $h = new WWW::Harvester ( Pages      => [ 'http://www.site.com/', 'http://www.other-site.com/' ],
                               Email      => 'your@address.org',
                               FetchDelay => 10,
                               DepthLimit => 5
                             );
  my $cr = sub {
    my ($response) = @_; 
    print $response->status_code, "\n";
  }
  $h->handler('200', $cr, 'response');
  $h->harvest();


DESCRIPTION

WWW::Harvester is a module that at it's core simply spiders the web. It provides a mechanism for supplying custom page handlers, one for each HTTP 1.1 status code. If no page handlers are supplied, the harvester does dead link detection, writing a report to standard out when it's finished.

The harvester does a breadth first traversal, and the depth to which it will traverse can be limited.


METHODS

$h = new WWW::Harvester(%options);
AgentName
The agent name to use in HTTP Request headers. Defaults to ``WWW::Harvester/$VERSION''

DepthLimit
This limits how far down the link tree the harvester should go. When set to '1', the harvester will only fetch pages in the supplied list. Default is '0', no limit.

Email
Email address of contact person for this harvester.

FetchDelay
How long to pause between requests, in seconds, measured from when requests are sent. Default is 30.

Handlers
A hash reference mapping HTTP status codes to code reference/argument pairs. That is { '200' => [ <code ref>, ``args''], '404' => [ <code ref>, ``args''], ... } See handler for information about the format of ``args''.

UriTest
A reference to a routine which will be used to determine whether a specific URI should be followed. The routine will be passed the Harvester object and the URI to be tested. If the URI fails the test, the routine should return undef. If the test is successful, a URI should be returned. It need not be the URI passed in. The default routine allows URIs =~ /\.s?html?$/i and any other URI whose content type is 'text/html', as determined by a HEAD request.

Pages
A reference to an array of initial URIs.

$h->handler(code, \&subroutine, argspec);
$h->handler(code, method_name, argspec);
This method sets a handler for a given HTTP status code.

Code is an HTTP 1.1 status code.

Subroutine is a reference to a subroutine which is called to process the page.

Method_name is the name of a method of $h which is called to process the page.


Argspec is a string containing a comma separated list that
describes the information reported by the event.  The
following argspec identifier names can be used:
self
Self causes the current object to be passed to the handler. If the handler is a method, this must be the first element in the argspec.

response
Response causes the HTTP::Response object returned from the request to be passed to the handler.

$h->harvest( uri );
This method starts the harvest. A URI can be passed in as an initial harvest page. This URI can be in addition to or instead of URIs passed at object initialization. This method returns nothing when finished.

$h->headRequest( $uri );
This method will do a HEAD request for the given URI. It will return the HTTP::Response object generated from that request. This method is provied for use in the UriTest routine.


SEE ALSO

LWP::UserAgent, http://www.perldoc.com/perl5.6.1/lib/LWP/UserAgent.html

HTML::Parser, http://www.perldoc.com/perl5.6.1/lib/HTML/Parser.html

WWW::RobotRules, http://www.perldoc.com/perl5.6.1/lib/WWW/RobotRules.html

http://www.ietf.org/rfc/rfc2616.txt


AUTHOR

Devon Smith <smithde@oclc.org>


COPYRIGHT

Copyright 2001 OCLC. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.