HTML::Normalizer - Removes HTML markup and normalizes the text.
use LWP::Simple; use HTML::Normalizer;
my $n = new HTML::Normalizer;
my $content = get('http://some-webpage.com/');
$n->parse($content);
for ( @{ $n->content } ) { print "$_\n"; }
HTML::Normalizer is an HTML parser that removes the markup from an HTML
document, and normalizes the remaining text. Normalization is done
according to either a default routine, or one supplied to the object.
HTML::Normalizer is a subclass of
HTML::Parser. This means that the document should be given
to the parser by calling the $n->parse() or
$n->parse_file() methods.
$n = new HTML::Normalizer(%options);$n->content;$n->clear;
This is the list of operations that are performed on each line of text, in the order they are preformed. See the source for more detail.
Get rid of all comment markup, malformed comment markup and carriage returns.
Decode all the entities that can be.
Get rid of all the entities that can't be decoded.
Remove all leading whitespace and all trailing tabs/spaces.
Replace internal sequences of tab/space with one space.
Compact multiple newlines into one.
Remove \n from the middle of strings.
Make everything lowercase.
HTML::Parser, http://www.perldoc.com/perl5.6.1/lib/HTML/Parser.html
Devon Smith <smithde@oclc.org>
Copyright 2001 OCLC. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.