HTML::MetaExtor - Extracts metadata from webpages.
use LWP::Simple; use HTML::MetaExtor;
my $ex = new HTML::MetaExtor();
my $content = get('http://some-webpage.com');
$ex->parse($content);
for ( @{ $ex->predicates } ) {
# print stuff out
}
HTML::MetaExtor is an HTML parser that extracts metadata from
an HTML document. HTML::MetaExtor is a subclass of
HTML::Normalizer, which is itself a subclass of HTML::Parser. This means
that the document should be given to the parser by calling the $ex->parse()
or $ex->parse_file() methods.
predicates()[ property, object ] or [ property, object, scheme ]
if 'scheme' is present, then object is supposed to be conformant to the specified scheme.
content()matched_keywords()all_keywords()clear()
The following creates an object that will, in addition to extracting metadata, normalize the text and look for the keywords 'word', 'term', 'silly', 'example', 'with spaces' and the terms listed in the file '/home/dir/list.txt'.
my $ex = new HTML::MetaExtor( keyword_file => 'file:/home/dir/list.txt',
keyword_list => 'word,term,silly,example,with spaces');
The next example creates an object that will normalize the text and look for the keywords listed in the file at 'http://homedir.com/list.txt'. The property name used in the result set for the keywords will be 'index_term'.
my $ex = new HTML::MetaExtor( keyword_file => 'http://homedir.com/list.txt',
keyword_property => 'index_term' );
In this example, no keywords will be looked for, but the text will be normalized.
my $ex = new HTML::MetaExtor( normalize => 1 );
Metadata linked to by the LINK element is (currently) ignored.
HTML::Normalizer documentation, HTML::Parser documentation
Devon Smith <smithde@oclc.org>
Copyright 2001 OCLC. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.