NAME

HTML::Normalizer - Removes HTML markup and normalizes the text.


SYNOPSIS

  use LWP::Simple;
  use HTML::Normalizer;
  my $n = new HTML::Normalizer;
  my $content = get('http://some-webpage.com/');
  $n->parse($content);
  for ( @{ $n->content } ) { print "$_\n"; }


DESCRIPTION

HTML::Normalizer is an HTML parser that removes the markup from an HTML document, and normalizes the remaining text. Normalization is done according to either a default routine, or one supplied to the object. HTML::Normalizer is a subclass of HTML::Parser. This means that the document should be given to the parser by calling the $n->parse() or $n->parse_file() methods.


METHODS

$n = new HTML::Normalizer(%options);
%options can have any or all of:
normalizer
A reference to a routine to normalize the text. The routine will be passed two arguments. The first will be a reference to the Normalizer object. The second will be the text to be processed. The routine must return the text after processing it.

do_br
Setting this option will result in <br> tags being replaced by newlines.

entities
Setting this option causes entities in the text to be decoded.

$n->content;
Returns a reference to an array. Each element of the array is a line of normalized text.

$n->clear;
Clears the data structures so that another HTML document can be processed.


NORMALIZATION

This is the list of operations that are performed on each line of text, in the order they are preformed. See the source for more detail.

  Get rid of all comment markup, malformed comment markup and carriage returns.
  Decode all the entities that can be.
  Get rid of all the entities that can't be decoded.
  Remove all leading whitespace and all trailing tabs/spaces.
  Replace internal sequences of tab/space with one space.
  Compact multiple newlines into one.
  Remove \n from the middle of strings.
  Make everything lowercase.


SEE ALSO

HTML::Parser, http://www.perldoc.com/perl5.6.1/lib/HTML/Parser.html


AUTHOR

Devon Smith <smithde@oclc.org>


COPYRIGHT

Copyright 2001 OCLC. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.