NAME

HTML::MetaExtor - Extracts metadata from webpages.


SYNOPSIS

  use LWP::Simple;
  use HTML::MetaExtor;
  my $ex = new HTML::MetaExtor();
  my $content = get('http://some-webpage.com');
  $ex->parse($content);
  for ( @{ $ex->predicates } ) {
    # print stuff out
  }


DESCRIPTION

HTML::MetaExtor is an HTML parser that extracts metadata from an HTML document. HTML::MetaExtor is a subclass of HTML::Normalizer, which is itself a subclass of HTML::Parser. This means that the document should be given to the parser by calling the $ex->parse() or $ex->parse_file() methods.


METHODS

$ex = new HTML::MetaExtor(%options);
%options can have any or all of:
keyword_file
URI of a file containing keywords to look for in the document. The keyword file is split on its newlines to get the keywords. This implies 'normalize'.

keyword_list
A list of comma seperated keywords to look for in the document. This implies 'normalize'.

normalize
Instructs the object to normalize the text without doing keyword searching. By default it does not do normalization.

keyword_property
A property name to be used for keywords found in the document. This applies to keywords found in meta elements and keywords specified in 'keyword_file' and 'keyword_list'. The default is 'keyword'.

$ex->predicates()
Returns a reference to an array of predicates. Each predicate is an array, with one of two structures:

[ property, object ] or [ property, object, scheme ]

if 'scheme' is present, then object is supposed to be conformant to the specified scheme.

$ex->content()
If the text is normalized, then the processed text is returned as an array reference. Each element of the array is a line of normalized text from the page. See HTML::Normlizer for more information.

$ex->matched_keywords()
Returns a hash reference whose keys are the keywords found in the document.

$ex->all_keywords()
Returns a hash reference whose keys are the keywords specified at object initialization.

$ex->clear()
Clears the data structures so that another HTML document can be processed. Does not clear the list of search keywords, but clear the keywords found in the previous document.


EXAMPLES

The following creates an object that will, in addition to extracting metadata, normalize the text and look for the keywords 'word', 'term', 'silly', 'example', 'with spaces' and the terms listed in the file '/home/dir/list.txt'.

  my $ex = new HTML::MetaExtor( keyword_file => 'file:/home/dir/list.txt',
                                keyword_list => 'word,term,silly,example,with spaces');

The next example creates an object that will normalize the text and look for the keywords listed in the file at 'http://homedir.com/list.txt'. The property name used in the result set for the keywords will be 'index_term'.

  my $ex = new HTML::MetaExtor( keyword_file => 'http://homedir.com/list.txt',
                                keyword_property => 'index_term' );

In this example, no keywords will be looked for, but the text will be normalized.

  my $ex = new HTML::MetaExtor( normalize => 1 );


ISSUES

Metadata linked to by the LINK element is (currently) ignored.


SEE ALSO

HTML::Normalizer documentation, HTML::Parser documentation


AUTHOR

Devon Smith <smithde@oclc.org>


COPYRIGHT

Copyright 2001 OCLC. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.