![]() | |
![]() |
![]() |
![]() | |
![]() | |
Style and Spelling Checker in Perl | |
![]() |
I wanted my program to work with URLs in the same way that it works with files. I think that file names and URLs really answer the same question: "Where is my stuff?"
Why do I have go to the trouble of thinking about files and URLs as different sorts of things? Why can't the program just handle the differences between file names and URL without manual intervention?" This program accepts an argument that is either a file name or a URL. The program tries to figure out which it is and do the right thing. I would like someone to write a module that does a better job of detecting whether a string is a file name or a URL.
I tried to use CPAN modules instead of my own code wherever possible. I was surprised that I couldn't find a module to make handling URLs and files work the same way.
Update: Sun Aug 1 17:05:43 PDT 2004
URL and file manipulation has been nicely
unified in the perl modules IO::All and IO::All::LWP.
I will enhance this tool to take advantage of this
capability in the future.
Since this program analyzes text, and text on the web is often in HTML format, the program automatically converts its input from HTML into plain text if necessary. I would like a CPAN module to detect whether a document is written in HTML.
I used a nice module for turning HTML into text, but this module doesn't really handle tables. This is a problem since many web sites have page s that are based on a hierarchy of tables.
I would like to improve this program. Feel free to improve it and send me the changes. I am particularly interested in CPAN modules which could replace my own code.
#!/usr/bin/perl # # Tom Anderson # Sun Feb 17 15:45:09 PST 2002 # Sun Mar 3 00:00:43 PST 2002 # Program to evaluate text complexity and spelling. # # Copyright Tom Anderson 2002, All rights reserved. # This program may be copied under the same terms as Perl itself. # Please send modifications to t@tomacorp.com # =pod Enhancements Ideas: A more verbose option to monitor progress Be able to specify an output file Be able to use the plain text output and ignore the analysis, or maybe send the analysis to STDERR and the plain text to STDOUT. Maybe there should be options for getting URL or text file - is there a module? Handle HTML tables properly (perhaps just remove table tags or use HTML::TableExtract) Work as a CGI, as a client for posting. Use a module or put file/URL guessing in a subroutine. =cut use strict; use warnings; use diagnostics; use Lingua::EN::Fathom; use Text::FormatTable; use Math::Round qw(nearest); use LWP::Simple; use HTML::TreeBuilder; use HTML::FormatText; use Lingua::Ispell qw( :all ); use Data::Dumper; $Lingua::Ispell::path= "/usr/bin/ispell/index.html"; my $VERSION=0.01; $|++; my %fog_description = ( 'unreadable' => [18,1e12], 'difficult' => [14,18] , 'ideal' => [11,14] , 'acceptable' => [8,11] , 'childish' => [-1e12,8]); my $file= shift; die "Usage: $0 file_or_url\n" if not defined $file; my $query= $file; my $content; if (-e $file) { my $slash= $/; local undef $/; open(IN, $file); $content= <IN>; close IN; $/= $slash; } else { $content = get($file); if ($content eq "") { die "No content at $file"; } $file= "/tmp/html_scan."; open(TMP, '>'.$file) or die "Can't create temporary html file"; print TMP $content; close TMP; } # If the content is HTML, format it as plain text first. my $content_type="plain text"; if ($query =~ /.htm$/i or $query =~ /.html$/i or $content =~ /<html/i) { $content_type="HTML"; $content =~ s/<table//g; $content =~ s/<TABLE//g; # print STDERR $content; my $tree = HTML::TreeBuilder->new_from_content($content); # my $tree = HTML::TreeBuilder->new->parse_file($file); my $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 78); $file= "/tmp/txt_scan."; open(TMP, '>'.$file) or die "Can't create temporary text file"; print TMP $formatter->format($tree); close TMP; } my $text = new Lingua::EN::Fathom; $text->analyse_file($file); my $accumulate = 1; my $text_string= ""; $text->analyse_block($text_string,$accumulate); my %words = $text->unique_words; # my $wordlist= join ' ', sort keys %words; my @wordlist= sort keys %words; my $fog = nearest(0.1, $text->fog); my $flesch = nearest(0.1, $text->flesch); my $kincaid = nearest(0.1, $text->kincaid); my $table = Text::FormatTable->new('r l l l'); my $fog_descr; for (keys %fog_description) { if ($fog >= $fog_description{$_}[0] and $fog < $fog_description{$_}[1]) # print Dumper($_),"\n"; { $fog_descr= $_; } } my $percent_complex_words = nearest(0.1,$text->percent_complex_words); $table->row("Fog", $fog, $fog_descr, ""); $table->row("Grade Level", $kincaid, "(Flesch-Kincaid)", ""); $table->row("Flesch", $flesch, "", ""); $table->row("Complex words", "$percent_complex_words %", "", ""); $table->row("Chars", $text->num_chars, "Words", $text->num_words); $table->row("Lines", $text->num_text_lines, "Blank Lines", $text->num_blank_lines); $table->row("Sentences", $text->num_sentences, "Paragraphs", $text->num_paragraphs); # Break up the wordlist before calling spellcheck since # spellcheck seems to have trouble with a large input string. my @missing_words; my $sublist=""; my $cnt=0; while (@wordlist > 0) { $sublist .= shift @wordlist; if ($cnt > 50) { for my $r (spellcheck($sublist)) { push @missing_words, $r->{'term'}; } $sublist=""; $cnt=0; } else { $sublist .= ' '; $cnt++; } } my $colname= "Spelling"; my @cols; my $col=0; for ( @missing_words ) { $cols[$col]= $_; if ($col >= 2) { $table->row($colname, $cols[0], $cols[1], $cols[2]); $col=0; $colname=""; } else { $col++; } } print "Analysis of $content_type: $query\n", $table->render();style.pl - download the code
By toma