LWP Quick Reference Guide
Several ways perl can grab web pages are GET, LWP::Simple, LWP::UserAgent, and LWP::RobotUA.

The simplest way is from the shell using the GET command, which is typically installed in the same directory as perl. Next easiest is to use LWP::Simple. For the most features use LWP::UserAgent, or LWP::RobotUA. It is a very good idea to put delays in your programs so that you do not overwhelm a web server with requests.

The simplest way is to just use the shell command:
GET "url"

Don't forget the quotes, they protect the shell.

If you are behind a firewall and you have a proxy, use:
GET -p"proxyurl" "url"

LWP::Simple may be easier than the rest of the LWP examples on this page.

Simple LWP Example

#!/usr/bin/perl

use strict;
use LWP::UserAgent;
my $content;

my $ua = new LWP::UserAgent;

# Various enhancement possibilities:
# $ua->max_size(100000);  # 100k byte limit
# $ua->timeout(3);        # 3 sec timeout is default
# $ua->proxy(['http'], 'http://myproxy.mycorp.com/');  # set proxy
# $ua->env_proxy()        # load proxy info from environment variables
# $ua->no_proxy('localhost', 'mycorp.com');   # No proxy for local machines

$ua->agent("Mozilla/6.0");  # Or something equally mysterious

my $req = new HTTP::Request GET => 'http://mycorp.com/';
my $res = $ua->request($req);
 
if ($res->is_success) 
{
  $content= $res->content;
} 
else 
{
  die "Could not get content";
}

Extracting Links

This program gets a web page then gets the links from it. It expands relative links to make them absolute.
#!/usr/bin/perl

use LWP::UserAgent;
use HTML::LinkExtor;
use URI::URL;
use Getopt::Std;
use strict;
my ($url, $opt_u);

getopts("u:");
if ($opt_u)
{
  $url= $opt_u
}
else
{
  $url = "http://mycorp.com/";
}

my $ua = new LWP::UserAgent;
 
# Set up a callback that collect links
my @links = ();
my @abs_links;

sub callback 
{
  my($tag, %attr) = @_;
  return if $tag ne 'a';  # we only look closer at 
  return if $attr{href} =~ /^mailto:/;  # ignore mailto:
  push(@links, values %attr);
}
 
# Make the parser.  Unfortunately, we don't know the base yet
# (it might be diffent from $url)
my $p = HTML::LinkExtor->new(\&callback);
 
# Request document and parse it as it arrives
my $res = $ua->request(HTTP::Request->new(GET => $url),
                    sub {$p->parse($_[0])});
 
# Expand all URLs to absolute ones
my $base = $res->base;
@links = map { $_ = url($_, $base)->abs; } @links;
for(@links)
{
  $_ = url($_, $base)->abs;
  if (/$url/)
  {
    push @abs_links, $_;
  }
  # don't go outside this site
}
 
# Print them out
print join("\n", @abs_links), "\n";

LWP::RobotUA

LWP::RobotUA is the polite LWP. Just use:
require LWP::RobotUA;
$ua = new LWP::RobotUA 'my-robot/0.1', 'me@foo.com';
$ua->delay(10);  # be very nice, go slowly
...
# just use it just like a normal LWP::UserAgent
$res = $ua->request($req); 

Documentation

The man page man lwpcook has more info and examples on:

HTTP Error codes are found in the module HTTP::Status.


Perldoc guide:
        LWP::MemberMixin   -- Access to member variables of Perl5 classes
          LWP::UserAgent   -- WWW user agent class
            LWP::RobotUA   -- When developing a robot applications
          LWP::Protocol          -- Interface to various protocol schemes
            LWP::Protocol::http  -- http:// access
            LWP::Protocol::file  -- file:// access
            LWP::Protocol::ftp   -- ftp:// access
            ...
 
        LWP::Authen::Basic -- Handle 401 and 407 responses
        LWP::Authen::Digest
 
        HTTP::Headers      -- MIME/RFC822 style header (used by HTTP::Message)
        HTTP::Message      -- HTTP style message
          HTTP::Request    -- HTTP request
          HTTP::Response   -- HTTP response
        HTTP::Daemon       -- A HTTP server class
 
        WWW::RobotRules    -- Parse robots.txt files
          WWW::RobotRules::AnyDBM_File -- Persistent RobotRules
 
       The following modules provide various functions and
       definitions.
 
        LWP                -- This file.  Library version number and documentation.
        LWP::MediaTypes    -- MIME types configuration (text/html etc.)
        LWP::Debug         -- Debug logging module
        LWP::Simple        -- Simplified procedural interface for common functions
        HTTP::Status       -- HTTP status code (200 OK etc)
        HTTP::Date         -- Date parsing module for HTTP date formats
        HTTP::Negotiate    -- HTTP content negotiation calculation
        File::Listing      -- Parse directory listings
        HTTP::Request::Common -- Construct common HTTP::Request objects

12/18/2000

By toma