複数のページからデータを取得し、ファイルに書き込む方法

2023/09/04 12:16

このQ&Aのポイント

URLを指定して複数のページからデータを取得し、ファイルに書き込む方法を解説します。
Parallel::ForkManagerモジュールを使用して複数のプロセスで並列処理を行い、効率的にデータを取得します。
ファイルへの書き込みはファイルハンドルを利用し、読み込みと書き込みの競合を避けるため同時に書き込むことができます。

ベストアンサー

Parallel::ForkManager(2)

2013/01/22 18:32

前回の投稿で完全に書き間違えた部分がありそのままではよくわからないものになってしまっていたのですが、投稿し直しなどができなかったので再度投稿させてもらいます。プログラムは use Web::Scraper; use WWW::Mechanize::Firefox; use Parallel::ForkManager; use URI; binmode STDOUT,":utf8"; sub func ; ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime(); $year += 1900 ; $mon += 1 ; $File = "yuma-$year:$mon:$mday:$hour.txt" ; open (file,"> $File") or die 'fail to open file\n'; print "HOW MUCH PAGE\n" ; my $page = <STDIN> ; print "WITEING...\n" ; my $MAX_PROCESSES = 5; my $pm = new Parallel::ForkManager($MAX_PROCESSES); for ($i = 1;$i <= $page;$i++) { $pm->start and next; my $uri = URI -> new ("www.目的とするＵＲＬ$i.html"); my $mech = WWW::Mechanize::Firefox->new(); $mech->get($uri) ; print file $s->scrape($mech->content) ; print file "\n" ; print file $r->scrape($mech->content) ; print file "\n" ; $pm->finish; } print "WITEID\n" ; my $s = scraper { process 'font',sen => 'TEXT'; result 'sen'; }; my $r = scraper { process 'div#content',ren => 'TEXT'; result 'ren'; }; close (file) ; です。 $i　の順序でファイルへ書き込みをしていきたいのですが、前の処理が終わっていなかった場合に待つ方法がわからないので初心者レベルで教えてください。

pipopipoid
お礼率70% (39/55)

Perl
回答数3
ありがとう数3

みんなの回答 （3）
専門家の回答

質問者が選んだベストアンサー

ベストアンサー

t-okura
ベストアンサー率75% (253/335)

2013/01/22 22:19 回答No.2

NO.1 さんの言う通りですが、別案として。並列処理しているそれぞれのプロセスで違うファイルに出力し、子プロセスがすべて終了した後に、親プロセスでそれらのファイルを順番にくっつけるという方法はどうでしょうか。

質問者

お礼 2013/01/22 22:46

つなぐ動作はおそらくシステムの部分を呼ぶのかと考えますが、ばらばらにつないでしまうと大きなファイルにしたときに読み取りに時間がががるのではないか、寄せ集めて集める前の部分を消した場合HDD（の寿命）に影響が出ないかが心配です。集めても1.2Mしか使わないので問題ないといえばないのですが・・・すいません、パソコン関係はソフトもハードも苦手なのであまりわからないので困っていますがみんなそうするのでしょうか？

質問者

補足 2013/01/23 01:15

とりあえずこの方法で進めていってみようと思います。

その他の回答 (2)

kmee
ベストアンサー率55% (1857/3366)

2013/01/22 23:18 回答No.3

これって、単にmy $mech = WWW::Mechanize::Firefox->new(); を毎回やってるのが遅いだけでは? my $mech = WWW::Mechanize::Firefox->new(); for ($i = 1;$i <= $page;$i++) { my $uri = URI -> new ("www.目的とするＵＲＬ$i.html"); $mech->get($uri) ; print file $s->scrape($mech->content) ; print file "\n" ; print file $r->scrape($mech->content) ; print file "\n" ; } にしたら十分な速度が得られたりしませんか? fiirefox通す必要が無いなら、LWPを使うとか。

質問者

お礼 2013/01/23 00:57

ありがとうございます、オブジェクト指向がよくわからないけどこれがそうなんだろうということでforkごとに仕事させるオブジェクトを作ってます。 firefoxがなければ早いのですがjavascriptが含まれているので仕方なく使っています。回線が悪いため？firefoxでは1ページ読み取りまで大体4秒前後かかり806Ｐでは間に合わない結果となります。改造を行ったプログラムが事故を起こしているので補足の方に書いておきますので誰が助けてください(..)

質問者

補足 2013/01/23 00:56

プログラムは use Web::Scraper; use WWW::Mechanize::Firefox; use URI; use Parallel::ForkManager; use utf8; my $s = scraper { process 'font',sen => 'TEXT'; result 'sen'; }; my $r = scraper { process 'div#content',ren => 'TEXT'; result 'ren'; }; ($sec, $min, $hour, $mday, $mon, $year, $wday, $yday, $isdst) = localtime(); $year += 1900 ; $mon += 1 ; print "HOW MUCH PAGE\n" ; my $page = <STDIN> ; chomp $page ; print "YA=WITEING...\n" ; my $Max = 10 ; my $pm = new Parallel::ForkManager($Max) ; for ($i = 1;$i <= $page;$i++) { $pm->start and next; $File = "current-ya$i.txt" ; open (file2, ">:utf8","$File") or die 'fail to open file\n'; my $uri = URI -> new ("目的とするURL$i"); my $mech = WWW::Mechanize::Firefox->new(); $mech->get($uri) ; print file2 $s->scrape($mech->content) ; print file2 "\n" ; print file2 $r->scrape($mech->content) ; print file2 "\n" ; print "this page is $i\n" ; close (file2) ; $pm->finish; } wait_all_children ; print "YASASIISEKAI=NOVEL=WITEID\n" ; print "CAT...\n" ; sleep 3 ; #system ("cat current-ya* > ya=$year:$mon:$mday:$hour.txt") ; #system ("rm -r current-ya*") ; print "PERFECT\n"; この結果が ai@ubuntu:~/Documents/ya$ perl test.pl HOW MUCH PAGE 14 YASASIISEKAI=NOVEL=WITEING... this page is 10 this page is 7 this page is 9 this page is 3 this page is 8 YA=WITEID CAT... this page is 6 this page is 4 this page is 5 this page is 2 this page is 1 PERFECT ai@ubuntu:~/Documents/yasasii$ this page is 11 this page is 12 this page is 14 this page is 13 ^C となり途中で止まり完全にバグってしまっています。

teketon
ベストアンサー率65% (141/215)

2013/01/22 21:17 回答No.1

全ての処理を順番に行うなら、Parallel（？）を使わなければいいのでは？

質問者

お礼 2013/01/22 22:36

807Pageあるため前回forkなどを使わずに実行した結果30分程度かかってしまいましたので困ってしまって多重化しようとしてます。

関連するQ&A

Perl WWW::Mechanize
恐れ入ります。WWW::Mechanizeを使って下記のサイトにアクセスしようとしても開くことが出来ません。業務上自動化したい部分があるのですが、画面が開かず、ロード中のままで開かない状態です。その他のサイトでは開くことが可能です。よろしくお願いいたします。 $url[0] = "https://salonboard.com/";　←　開かない $url[1] = "https://www.google.com";　←　開く my $mech = WWW::Mechanize->new( agent=>"Mozilla/5.0 (Windows NT 10.0; Win64; x64) " ); my $response = $mech->get( $url[0] ); print "Content-type: text/html;\n\n"; print $mech->content;
- ベストアンサー
- Perl
Amazon サイトからhtmlを取得すると文字化
すみません。素人です。 Amazon サイトからhtmlを取得すると文字化けしてしまいます。方法ってないでしょうか？＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊ use strict; use warnings; use Web::Scraper; use URI; use Encode qw/encode_utf8/; my $scraper = scraper { process('div', 'title' => 'TEXT'); }; my $url = URI->new('https://www.amazon.co.jp/s/ref=nb_sb_noss?__mk_ja_JP=%E3%82%AB%E3%82%BF%E3%82%AB%E3%83%8A&url=search-alias%3Daps&field-keywords=test'); my $res = $scraper->scrape($url); print encode_utf8($res->{title}) . "\n"; ＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊
- ベストアンサー
- Perl
WWW::Mechanizeがプロシキ経由になってしまう
WWW::Mechanizeを使って診断君にアクセスすると判定結果が「判定：プロクシです proxy判定箇所が 1箇所、疑惑点が 2箇所ありました。」と出てしまうのですが #$mech->proxy('http', $proxy);の部分をコメントにしているのにどうしてでしょうか？何方か教えてもらえませんか？＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿ #!/usr/bin/perl use lib "lib"; use LWP::UserAgent; use WWW::Mechanize; $proxy = "http://xxxx.xxxx.jp:80"; $site = "http://www.taruo.net/e/"; my $mech = WWW::Mechanize->new(); #$mech->proxy('http', $proxy); $mech->add_header( Referer => 'http://www.yahoo.co.jp', Proxy-Connection => 'Keep-Alive', Connection => 'Keep-Alive', USER_AGENT => 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)', ACCEPT => 'image/gif,image/x-xbitmap,image/jpeg,image/pjpeg,image/pnp,*/*', ACCEPT-CHARSET => 'iso-8859-1,*,utf-8', ACCEPT_LANGUAGE => 'ja', ACCEPT_ENCODING => 'gzip,deflate', Content-Type => 'application/x-www-form-urlencoded', Content-Length => '100', HOST => 'www.taruo.net', HTTP_CONNECTION => 'Keep-Alive', ); $mech->post("$site"); $html = $mech->content; print "Content-type: text/html;\n\n"; print<<endend; $html endend exit;
- ベストアンサー
- Perl
Perl/TkのWWW::MechanizeのGET失敗時の動作
WindowsにてPerl/Tkを使用しております。 WWW::MechanizeのGETを使って(おそらく)404 not foundを受信した後の動作の制御の仕方がわかりません。下記ソースを実行して貰うとわかると思うのですが、 GETが失敗したら即sub関数内をreturnしてしまいます。 #!/usr/bin/perl use Tk; use Encode; use WWW::Mechanize; my $mech = new WWW::Mechanize(autocheck => 1); my $main = MainWindow->new; $main->Button(-text=>decode("cp932",'取得'),-font => ["ＭＳゴシック", 12], -command => \&get_source) ->grid(-row=>0,-column=>0); my $list = $main->Listbox(-selectmode=>'single',-font => ["ＭＳゴシック", 12]); $list->grid(-row=>1,-column=>0,-sticky=>'ew'); MainLoop; sub get_source{ my $uri = "http://www.google.co.jp/abcd"; $mech->get($uri); print "get_source end\n"; } 上記の「print "get_source end\n";」は実行されません。データ取得先サイトがメンテナンス中は404になるようで困っています。イメージでは「$mech->get($uri) or die "get_source error";」みたいな感じにエラーの場合の時の処理を記述したいのです。ご存知の方いましたらご教示願います。使用環境・Windows XP Professional SP3 ・Active Perl v5.8.6 for MSWin32-x86-multi-thread build811 ・自作PC(CPU:Athlon 64 3800+)
- ベストアンサー
- Perl
Web::Scraperの使い方で
PerllのWeb::Scraperを使って商品の情報を取得するみたいなことをしたいのですが、下のコードでなぜかブランドと製造販売元が取得することができません。 http://www.kenko.com/product/item/itm_6521067072.htmlのソースを見てみると、<div class="ltxt brand">にブランド情報が、<div class="ltxt">に製造販売元の情報が書かれています。それなので、process "div.ltxt", "maker" => "TEXT";とすれば製造販売元の情報（文字列）が取得されるはずなのですが、なぜかブランド情報（process "div.ltxt.brand", "brand" => "TEXT";として得られるはず？）が得られてます。これはなぜでしょうか？またこのスクリプトWeb::Scraperでは基本的にclass="***"のようにクラスの定義がされていないものについては情報を取得できないのでしょうか？例えば、同ページのソースコードにある<div>内容量：250ml</div>や<div itemprop="identifier" content="upc:4987222787413">JANコード：　4987222787413</div>から内容量とJANコードの情報を得ることはできないでしょうか？ #!usr/bin/perl use strict; use warnings; use Encode; use Web::Scraper; use URI; use utf8; binmode STDOUT, ":encoding(shiftjis)"; #scraperオブジェクトを作成 my $shinkan = scraper{ #processメソッドで抽出する要素と展開方法を指定 process "div.mainBox","fil_list[]"=> scraper{ #scraperメソッドを渡して,"td.c-table01"要素をさらに展開する process "h1.fn", "goods" => "TEXT"; # 商品名 process "div.ltxt.brand", "brand" => "TEXT"; # ブランド process "div.ltxt", "maker" => "TEXT"; # メーカー process "span.price" => "price" => "TEXT"; # 価格 }; }; #scrapeメソッドでスクレイピングを実行 my $res = $shinkan->scrape(URI->new("http://www.kenko.com/product/item/itm_6521067072.html")); foreach my $dat(@{$res->{fil_list}}){ #undefの場合は、""を代入する。 unless(defined($dat->{goods})) {$dat->{goods} = ""}; unless(defined($dat->{brand})){$dat->{brand} = ""}; unless(defined($dat->{maker})) {$dat->{maker} = ""}; unless(defined($dat->{price})){$dat->{price} = ""}; print "goods = ", $dat->{goods}, "\n"; print "brand = ", $dat->{brand}, "\n"; print "maker = ", $dat->{maker}, "\n"; print "price = ", $dat->{price}, "\n"; };
- ベストアンサー
- Perl
perl-cgiのリネームについて
Perl-CGIで ABCという、ディレクトリの中のファイルの名前を、ランダムな名前に変換したいのですがうまくいきません。これを動作させるたびになぜか、どんどんファイルが減っていってしまいます。どなたか、教えていただけないでしょうか? 宜しくお願い致します。 #!/usr/local/bin/perl print "Content-Type: text/plain\n\n"; $| = 1; my ($sec,$min,$hour,$mday,$mon,$year,$wno) = localtime(time); my ($nowtime) = sprintf("%02d_%02d_%02d_%02d_%02d_",$year+1900,$mon+1,$mday,$hour,$min,$sec); #ディレクトリのファイル個数を記録する $dir = "./ABC/"; # ← ディレクトリを変数にセットする opendir DIR, $dir; @files = grep { !m/^(\.|\.\.)$/g } readdir DIR; # ← 「.」「..」以外のファイルを取得 close DIR; srand; for (my $i = @files; --$i; ) { my $j = int rand ($i + 1); next if $i == $j; @files[$i, $j] = @files[$j, $i]; } $num = 0; use File::Copy; foreach(@files){ $getpath = "$dir"."$_"; if( copy($getpath, "$dir".$nowtime.$num++.'.dat') eq 1){ $num++; unlink($getpath);}else{print "Copy Error"; exit;} }
- ベストアンサー
- CGI
WWW::Mixi::Scraper
はじめまして。Perl初心者です。このたび、WWW::Mixi::Scraperを使ってみようと思い、 http://search.cpan.org/~ishigaki/WWW-Mixi-Scraper-0.11/lib/WWW/Mixi/Scraper/Mech.pm の、 SYNOPSISに書いてあるソースを実行してみたのですが、 Use of uninitialized value in concatenation (.) or string at [パス]/WWW/Mixi/Scraper/Mech.pm line 18. logged in to mixi at [パス]/WWW/Mixi/Scraper/Mech.pm line 44. Undefined subroutine &WWW::Mixi::Scraper::Mech::_uri called at [パス]/WWW/Mixi/Scraper/Mech.pm line 72. と、エラーが出てしまいました。ちなみに、foo@～のところとpasswordには自分のmixiにログインするパスを入れています。他に追記が必要なのかとも思いましたがまったくわかりません…。どなたかご教授いただければ幸いです。
- 締切済み
- Perl
ランダムに1行読込み　フォーム文字との同異を　判定するには？
txt ファイル　から　ランダムに　1行読み込んでそれが　フォームの入力文字と同じかを　判定し、同じならば、次のファイルへ飛ぶ　には、どうしたらよいのでしょうか？　お教えください。 ------------------------------------------------------ mondai.txt　に　問題を　1行に　1題ずつ kotae.txt　に　解答を　1行に　ひとつずつ書きました。 mondai.txt　の　1行目の　解答が　kotae.txt　の　1行目（2行目-10行目、同じ）としました。 001.cgi で　正解すれば　002.cgi を開く不正解なら　もう一度 001.cgi を実行。問題は　ランダムに　表示する。まず、mondai.txt　と　kotae.txt　を　それぞれ　10行にしてやってみました。 ------------------------------------------------------ すると、問題は　ランダムに表示されました。しかし、 ##################################################### if (param('answer') eq ("$ans")){ print "Location:$num2.cgi\n\n"; ・・・ { ##################################################### が、うまくいきません。 ↓こうしてあります。どうしたら、よいでしょうか？ ↓解答したことを　file.txt に記録しよう、ともしています。 ↓ ################ はじめ ############################## #!/usr/bin/perl -T use strict; use warnings; use CGI qw(-debug :standard); my ($sec,$min,$hour,$day,$mon,$year,$wday,$yday,$isdst) = localtime(time); $year += 1900; $mon = sprintf("%02d", $mon + 1); $day = sprintf("%02d", $day); $hour = sprintf("%02d", $hour); $min = sprintf("%02d", $min); $sec = sprintf("%02d", $sec); my $num = ('1267'); my $num2 = $num + 1; my $value = param('answer'); open(FILE, "../../mondai.txt"); my @data2 = <FILE>; close(FILE); open(FILE, "../../kotae.txt"); my @data3 = <FILE>; close(FILE); my $i = int(rand(10)); my $ans = $data3[$i]; if (param('answer') eq ("$ans")){ print "Location:$num2.cgi\n\n"; open(FILE, '<+file.txt') or die "$!"; my @DATA = <FILE>; close(FILE); open(FILE, '>file.txt') or die "$!"; print FILE ("$num,$ans,$year年$mon月$day日 $hour時$min分$sec秒\n"); print FILE (@DATA); close(FILE); } else { print header(-charset => 'Shift_JIS'); print start_html("$num.cgi"); print ('<font face="century">'); print $data2[$i]; #####---問題表示 print br; print startform( -method => 'POST', -action => "$num.cgi" ); print blockquote( textfield( -name => 'answer', -size => '70', -id => "next", -accesskey => '[' ) ); print <<END; <script type="text/javascript" language="JavaScript"> document.getElementById('next').focus(); </script> END print submit(-value => 'Send ( ↑Alt+[ )'); print endform; } ################ おわり ############################## よろしく、お願いいたします。
- ベストアンサー
- Perl
PerlのWeb::Scraperと正規表現について教えてください。
以下のスクリプトで困ってます。質問をスクリプトの下に書きましたので、よろしくお願いします。 #------------------------------------------------------------------ #!C:/strawberry/perl/bin/perl use strict; use Web::Scraper; use URI; use YAML; use encoding 'shiftjis'; my $stuff = URI->new("http://table.yahoo.co.jp/t?s=9503.t&a=5&b=13&c=2009&d=8&e=14&f=2009"); my $scraper = scraper { process "table table table tr td small", 'news[]' => { title1 => 'text' }; }; my $result = $scraper->scrape($stuff); print YAML::Dump $result; #------------------------------------------------------------------ 上記のPerlスクリプトでは、 --- news: - title1: 年 - title1: 月 - title1: 日 - title1: から - title1: までのデータ - title1: デイリー - title1: 週間 - title1: 月間 - title1: '銘柄コード： ' - title1: 2009年8月14日 - title1: '2,075' - title1: '2,090' - title1: '2,070' - title1: '2,080' - title1: '1,449,300' - title1: '2,080' - title1: 2009年8月13日 - title1: '2,090' - title1: '2,090' - title1: '2,080' - title1: '2,090' - title1: '1,137,900' - title1: '2,090' のような結果が出ます。結果を以下のようにCSVで取り出したいのですが、ロジックをどのように変更すればよいでしょうか？　多分正規表現を使うんだと思うのですが、分かりません。 http://weblog.nqou.net/archives/20090301140728.html ↑このページも確認していろいろ試しましたが、上手くいきません。ご指導よろしくお願いします。こういう↓結果にしたいです。 2009年8月14日,2075,2090,2070,2080,1449300,2080 2009年8月13日,2090,2090,2080,2090,1137900,2090 以上
- ベストアンサー
- Perl
PerlのCGIでこの期日判定方法は正しいですか？
指定日の23時59分59秒以前にアクセスすると before.pdf を表示、翌0時00分00秒からは after.pdf を表示させるためのジャンプ台にするCGIを作ろうとしているのですが、アクセスした日時が指定日の23時59分59秒以前かどうかを判別する理論設定が下記の方法で合っているか心配です。この例では今月末までは before.pdf を表示するという設定のつもりですが、論理に穴があって思いもよらず6月中にも after.pdf を表示してしまうケースがあったり7月になってからも before.pdf を表示してしまうケースが発生しないでしょうか？もし書き方が誤っていましたら添削して頂けるとありがたく存じます。どうかよろしくお願い致します。 #!/usr/bin/perl use strict; use warnings; # アクセス時刻の取得 my ($sec, $min, $hour, $mday, $mon, $year) = localtime(time); # localtime関数の調整 $year += 1900; $mon += 1; # アクセス日時に応じたファイルにジャンプ my $file_name; if ($year <= 2023 && $mon <= 6 && $mday <= 30 && $hour <= 23 && $min <= 59 && $sec <= 59) { $file_name = "before.pdf"; } else { $file_name = "after.pdf"; } # ヘッダー出力およびリダイレクト print "Content-Type: text/html\n"; print "Location: $file_name\n\n";
- ベストアンサー
- Perl

複数のページからデータを取得し、ファイルに書き込む方法

Parallel::ForkManager(2)