Ask Your Computer Question. Computer Experts Answer You ASAP.

(Not a Computer Question?)

I need a perl script to take text from a wiki page:
Sent to Computer Experts March 14 12:17 PM

I need a perl script to take text from a wiki page: http://www.financialmathematics.com/wiki/index.php/Matlab_example_with_dependency.m and write to a file only the preformatted text in which the first word of the preformatted text is "function". In other words, I want to strip out the code embedded on the wiki page

 

Optional Information:
OS: Linux; Browser: Firefox

Customer (name blocked for privacy)
Answer
March 15 4:45 AM (16 hours and 27 minutes and 21 seconds later)
         
ACCEPTEDCheck Mark
Hi Customer (name blocked for privacy), thanks for the question. I hope this code will help you. Save it for example in a file called "fetch.pl" and start it using the comand line: "perl fetch.pl" or even only "fetch.pl" (depening on your system)

#-----------------------START OF CODE
#!/usr/bin/perl -w

use strict;

package HTMLStrip;

use base "HTML::Parser";

use LWP::Simple;

# here you define the file that the program will write the output into
open(DATA, ">output.txt");

my $pre_flag = 0;
my $html ="";

#used to react to the tags that the wiki uses for the code start
sub start {
my ($self, $tag, $attr, $attrseq, $origtext) = @_;
if ($tag =~ /^pre$/i) {
# set if we find the tag
$pre_flag=1;
}
}

#all within the pre tags will simply be written to file
sub text {
my ($self, $text) = @_;
if ($pre_flag==1) {
print $text; #(for debug purposes i write the output to console as well)
print DATA $text; #write the data to file
}
}

#used to react to the tags that the wiki uses for the code start
sub end {
my ($self, $tag, $origtext) = @_;
# reset appropriate flag if we find the tag
if ($tag =~ /^pre$/i) { $pre_flag = 0; }
}

my $p = new HTMLStrip;
#load the whole html page
$html = get("http://www.financialmathematics.com/wiki/index.php/Matlab_example_with_dependency.m");

#parse the page
$p->parse($html);

#close the file
close(DATA);


#------------------ END OF CODE -----------------

Edited by Kerim on March 15 2007 at 4:49 AM

1 Other Expert Agrees with this!
Reply
March 19 6:03 PM (4 days and 13 hours later)
         
Relist: I still need help.
This script garbles some special characters when run on linux. For example > becomes > & becomes & and so forth. Is there a way to fix this little glitch?
Answer
March 20 4:44 AM (10 hours and 41 minutes and 46 seconds later)
         
REPLIEDCheck Mark
Sadly i do not have a unix system running at home. I therefore can not give you a solution that is guaranteed to work.

I would assume that it is an encoding issue. So while converting the string from the webpage to perl's internal format or while converting from perl's internal format to the output format on your system there might be a problem.

I would suggest that you try it with the Encode module
Found here: http://www.ayni.com/perldoc/perl5.8.0/lib/Encode.html

I will open the question for others so they might be able to help.

Reply
March 20 11:22 PM (18 hours and 37 minutes and 44 seconds later)
         
Reply to Kerim's Post: Any thoughts? Anything which works on linux is fine. E.g. could use wget, sed etc.
Answer
March 22 3:44 AM (1 day and 4 hours later)
         
ACCEPTEDCheck Mark
I have XP with Perl 5.8 and the script doesn't produce that problem here (Germany).
The text is read directly by the HTML::Parser class.
I scanned through the api again.
It normally asumes utf-8 encoding (which on your site is the case).
For older versions it only supports latin-1 it seems.

So one solution that would most probably work (if you have perl <5.8) would be to upgrade your perl distribution.
Think you can answer this question?
Login or Become an Expert

 

DISCLAIMER: You acknowledge that any information you may obtain from individuals you contact through use of the Just Answer service comes from those individuals, not from Just Answer!, and that Just Answer is not in any way responsible for any of the information these third parties may supply. The site and services are provided "as is" with no warranty and no representations are made regarding the qualification of an Expert. Responses and comments on Just Answer! are for general information and are not intended to substitute for informed professional advice (such as medical, legal, investment or accounting) and do not establish a professional-client relationship. Just Answer! is not intended or designed to address EMERGENCY QUESTIONS which should be directed immediately by telephone or in-person to qualified professionals. Please carefully read the Terms of Service.

Just Answer! > Computer and Software Help