ベストアンサー

正規表現検索

2007/09/13 04:08

Javaのソースコード内から特定の単語を検索したいのですが、このとき、以下の条件があります。（１）ブロックコメント内を無視する（２）行コメント内を無視する（３）変数文字列内を無視する 01 /* 02 * ここの abc はブロックコメント内なので無視する 03 * 04 */ 05 public class Foo() { 06 　private int abc = 0; 07 08 　public Foo() { 09 　　// 行コメント内なのでここの abc を無視 10 　　abc = 1; 11 　　String s = "変数文字列内の abc これも無視"; 12 　} 13 14 　public String get() { 15 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット 16 　} 17 } 例えば、上記のテキストで abc を検索したとき、６、１０と１５行目の後ろの３箇所のみヒットさせたいのですが、これはどのように正規表現で記述すればよいのでしょうか。 /* で始まり、*/ が記述されるまでがブロックコメントです。 // があったら、その行末までが行コメントです。 " で囲まれた中が変数文字列です。文字列内の \" は無視します。よろしくお願いいたします。

HarukaV49
お礼率33% (31/93)

Perl
回答数11
ありがとう数6

みんなの回答 （11）
専門家の回答

質問者が選んだベストアンサー

ベストアンサー

sakusaker7
ベストアンサー率62% (800/1280)

2007/09/19 17:52 回答No.11

ファイルの内容をひとつの文字列に丸呑みするのはいちいち1行ずつ読み込んで連結するより、 #6の方の回答にあるように、$/ を操作してしまったほうが高速にできますしメモリも無駄遣いしません。また、正規表現マッチングにおいては $` $& $' を使うと速度的なペナルティがあります。ドキュメント perlvar.pod より $MATCH $& The string matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). (Mnemonic: like & in some editors.) This variable is read-only and dynamically scoped to the current BLOCK. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. $PREMATCH $` The string preceding whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval enclosed by the current BLOCK). (Mnemonic: "`" often precedes a quoted string.) This variable is read-only. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. $POSTMATCH $' The string following whatever was matched by the last successful pattern match (not counting any matches hidden within a BLOCK or eval() enclosed by the current BLOCK). (Mnemonic: "'" often follows a quoted string.) Example: local $_ = 'abcdefghi'; /def/; print "$`:$&:$'\n"; # prints abc:def:ghi This variable is read-only and dynamically scoped to the current BLOCK. The use of this variable anywhere in a program imposes a considerable performance penalty on all regular expression matches. See "BUGS". See "@-" for a replacement. #!/usr/bin/perl use strict; use warnings; #slurp the file my $content = do { local $/ = undef; <DATA>}; my @linestarts = (0); push @linestarts, pos($content)+1 while ($content =~ m/\n/g); my $searchword = $ARGV[1] ? $ARGV[1] : 'abc'; my $comment = qr<(?: /\* .*? \*/ )>xms; my $line_comment = qr<(?: // .*? $ )>xms; my $string = qr<(?: "[^\\\"]* (?: \\\" [^\"]* )*" )>xms; my $skip = qr<(?: $comment | $line_comment | $string )>xms; sub get_lineno { my $pos = shift; my $elems = scalar @linestarts; my $idx; for ($idx=1; $idx < $elems; $idx++) { last if $linestarts[$idx] > $pos; } $idx; } sub get_start_pos { my $pos = shift; my $elems = scalar @linestarts; my $idx; for ($idx=1; $idx < $elems; $idx++) { last if $linestarts[$idx] > $pos; } $linestarts[$idx-1]; } while ($content =~ m/(?: $skip | ($searchword))/xmsg) { printf "%4d:%4d\t%s\n", get_lineno(pos($content)), pos($content)-get_start_pos(pos($content))-1, $1 if $1; } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { private int abc = 0; public Foo() { // 行コメント内なのでここの abc を無視 abc = 1; String s = "変数文字列内の abc これも無視"; } public String get() { return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット } }

質問者

お礼 2007/09/20 17:38

ご回答ありがとうございます。コンパイラはもとより、エディターも瞬時に、コメントか文字列か、それ以外のキーワードかを瞬時に判断して色を付けたりしますが、これを実装しようとするとなかなかややこしいですね。コンパイラがなぜブロックコメントのネストを認識しないのか不思議でしたが、これを認識する方法が難しいことがなんとなく理解できて来ました。個人的にJavaで実装して確認するまでは、もう少し時間がかかりますので、まずはお礼の挨拶まで。

その他の回答 (10)

kumoz
ベストアンサー率64% (120/185)

2007/09/19 09:27 回答No.10

#9 の捕捉に書いてある正規表現は私の手に負えないので、質問者の望むもととは違うかもしれません。ごく短い簡単なコードです。 use strict; my $code = join '', <DATA>; my @abc_idx; push @abc_idx, length $` while $code =~ /abc/g; foreach my $i (@abc_idx) { my $pre = substr $code, 0, $i; $pre =~ s/abc/xyz/g; my $aft = substr $code, $i + 3; $aft =~ s/abc/xyz/g; my $line_no = $pre =~ tr/\n// + 1; $pre =~ /(.*)$/; my $line_pos = length($1) + 1; if ("${pre}abc$aft" =~ m#/\*.*?abc.*?\*/|//[^\n]*?abc|[^\\]"([^\\"]|\\")*?abc([^\\"]|\\")*?"#s) { print "X: line $line_no, pos $line_pos\n"; } else { print "O: line $line_no, pos $line_pos\n"; } } __DATA__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { 　private int abc = 0; 　public Foo() { 　　// 行コメント内なのでここの abc を無視　　abc = 1; 　　String s = "変数文字列内の abc これも無視"; 　} 　public String get() { 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット　} } 実行結果は、次のようになります。 X: line 2, pos 10 O: line 6, pos 15 X: line 9, pos 33 O: line 10, pos 5 X: line 11, pos 32 X: line 15, pos 21 O: line 15, pos 29 X: line 15, pos 52

moon_piyo
ベストアンサー率60% (88/146)

2007/09/15 17:21 回答No.9

こんちは #!perl use strict; my $target = "abc"; #検索ワード my $file = ""; my $p1 = 0; my $row = 0; my @line = (''); my @rc = (); while (<DATA>) { $row++; #ファイル全体を1つの文字列に格納 $file .= $_; foreach my $col (1..length($_)) { #文字列の出現位置から[行、列]を出すテーブル作成 $rc[$p1+$col-1] = [$row, $col]; } $p1 += length($_); chomp; #各行を配列に格納 push(@line, $_); } my $p2 = 0; #今回検索開始位置 while ($file =~ m%/\*(?:.*?)\*/|//.*?$|"(?:[^"\\]|\\.)*"|\z%smg) { #除外部分(/*..*/ or //...(行末) or "..." or 終端)を探す #検索開始位置から、マッチ部分(除外部分)の直前までの文字列に着目する #検索ワードがみつかったら、元のファイルでの行、列に換算して表示する my $str = substr($`, $p2); while ($str =~ /$target/og) { my ($row, $col) = @{$rc[$p2 + length($`)]}; print "$row行目$col文字目: $line[$row]\n"; } $p2 = pos($file); } __DATA__ 01 /* 02 * ここの abc はブロックコメント内なので無視する 03 * 04 */ 05 public class Foo() { 06 　private int abc = 0; 07 08 　public Foo() { 09 　　// 行コメント内なのでここの abc を無視 10 　　abc = 1; 11 　　String s = "変数文字列内の abc これも無視"; 12 　} 13 14 　public String get() { 15 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット 16 　} 17 }

質問者

補足 2007/09/16 00:27

お世話になります。正規表現パズル　http://oraclesqlpuzzle.hp.infoseek.co.jp/regex/ のサイトから、以下の２つの方法を得ました。＜１＞括弧でくくられていない abc を検索する abc(((?=([^(]*$){4}[^(]*$)(?=([^)]*$){4}[^)]*$))| ((?=([^(]*$){3}[^(]*$)(?=([^)]*$){3}[^)]*$))| ((?=([^(]*$){2}[^(]*$)(?=([^)]*$){2}[^)]*$))| ((?=([^(]*$){1}[^(]*$)(?=([^)]*$){1}[^)]*$))| (?=[^()]*$)) ＜２＞文字列データ以外から abc を検索 abc(?=(((((?<=\\)|(?!(\\{2})*")).)*(?<!\\)(?=(\\{2})*").){2})* (((?<=\\)|(?!(\\{2})*")).)*$) <1>の方法を利用して、/* ... */ でくくられていない部分の abc を検索するように書き換えを試みましたが成功しませんでした。上記の方法を応用して、（１）ブロックコメント内を無視する（２）行コメント内を無視する（３）変数文字列内を無視するを組み合わせた検索方法を実現できないでしょうか？

sakusaker7
ベストアンサー率62% (800/1280)

2007/09/15 14:58 回答No.8

あんまりひねたデータでいじめてないので、多分抜けはあると思いますがこんな感じでどうでしょうか。 #とりあえず文字列の中に /* とか */ が登場するとおかしくなると思います。 #!/usr/bin/perl use strict; use warnings; my $searchword = $ARGV[1] ? $ARGV[1] : 'abc'; my $comment_start = qr</\*>x; my $comment_end = qr<\*/>x; my $line_comment = qr<// .* $>x; my $string = qr<"[^\\"]* (?: \\" [^"]* )*">x; my $incomment; while (my $line = <DATA>) { my $start_pos = 0; chomp $line; #一行コメントを削除 $line =~ s/$line_comment//; #複数行コメントの中かどうか判定 if ($line =~ m/$comment_start/) { $start_pos = $-[0]; $incomment = 1; } if ($line =~ m/$comment_end/) { $incomment = 0; #複数行コメントの後ろに実データがあるときのために #コメント部分だけスペースで置き換える my $replace_length = $+[0]; substr $line, 0, $+[0], " " x $replace_length; } if ($incomment==1) { next if $start_pos == 0; my $replace_length = length($line) - $start_pos + 1; substr $line, $start_pos, $replace_length, (" " x $replace_length); } while (my $word = ($line =~ m{\G [^"]* (?:$string)? [^"]*? ($searchword)}gx)) { print "'$searchword'が", "$.行目の" , $-[1]+1, "文字目にあります : ", $line, "\n"; } } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { private int abc = 0; abc = 3;/* */abc=1; public Foo() { // 行コメント内なのでここの abc を無視 abc = 1; String s = "変数文字列内の abc これも無視"; } public String get() { return " 1'23\" abc " + abc + abc; // この場合後ろの abc のみヒット } } 実行結果: 'abc'が6行目の16文字目にあります : private int abc = 0; 'abc'が7行目の4文字目にあります : abc = 3; 'abc'が8行目の6文字目にあります : abc=1; 'abc'が12行目の8文字目にあります : abc = 1; 'abc'が17行目の32文字目にあります : return " 1'23\" abc " + abc + abc; 'abc'が17行目の38文字目にあります : return " 1'23\" abc " + abc + abc; 何行目にあるかは特殊変数 $. の値を何文字目なのかは特殊配列変数 @- の値を使っています。これらの変数の詳しい説明は perldoc perlvar でマニュアルを参照してください。なお、'あ' のような文字は一文字としては数えません。使用するエンコーディングにより、2または3になります。

sakusaker7
ベストアンサー率62% (800/1280)

2007/09/13 19:28 回答No.7

いやあもったいぶるほどネタ持ってませんから。とりあえず元のデータを別に保存しとくとかはすぐに思いつきますけど、どうにかできないもんかなあと頭をひねってるところです。 */ の後の件は見落としていました。

質問者

補足 2007/09/14 22:43

大変失礼しました。回答メールが届いてないので、まだ一件も回答が寄せられていないのかと思っていました。質問内容があいまいで申し訳ありませんでした。結果として得たいのは、検索した文字の位置情報で、サンプルの場合には、　６行目１７文字目　１０行目６文字目　１５行目３０文字目という結果です。現在Javeでプログラミングを行っておりまして、一行ごとにデータを読み込んで、処理しようと考えています。（回答が１件もないと思っていたため）ですから、　　１行のデータでコメント，文字列以外から指定の文字が　　何文字目にあるかを検出する　　（検出位置はコメント，文字列を含めた字数をカウントする）ことができれば十分です。以上、よろしくお願いいたします。

dontoittem
ベストアンサー率0% (0/1)

2007/09/13 19:07 回答No.6

そうだな、#1～#4 より #5 のほうが効率がいいだろな。*/ のあとに何か書いてると、行番号が消えるのが気になるが。コメントや文字列の情報も検索時に使いたいかも知れないから、一番いいのは元の情報を保存して検索できることだろね。#5 がもったいぶらずに回答してやればいいのではないかな。 #!/usr/bin/perl # 全部読み込む undef $/; $_ = <DATA>; # 改行を保存してコメントを除く s%(//.*?$|/\*.*?\*/)%"\n" x $1=~ tr/\n//%egms; # 文字列を除く s/"[^"]*(?:\\"[^"]*)*"//g; my $pat = qr/\babc\b/; my @lines = split "\n"; # キーワード検索 for ($i = 0; $i < @lines; ++$i) { my @ids = $lines[$i] =~ m/$pat/g; print $i + 1 , ": @ids\n" if (@ids); } __END__ /* * ここの abc はブロックコメント内なので無視する * // 行番号情報が消える */ abc = cde; public class Foo() { 　private int abc = 0; 　public Foo() { 　　// 行コメント内なのでここの abc を無視　　abc = 1; 　　String s = "変数文字列内の abc これも無視"; 　} 　public String get() { 　　return " 123\" abc " + abc + abc; // この場合後ろの abc のみヒット　} }

質問者

補足 2007/09/14 22:46

sakusaker7
ベストアンサー率62% (800/1280)

2007/09/13 16:57 回答No.5

指定の単語(パターン?)を検索したいとのことですが、検索結果の出力は#1～#4までで提示された形式でいいのでしょうか? なんとなく気になったので質問します。 #!/usr/bin/perl use strict; use warnings; my $searchword = $ARGV[1] ? $ARGV[1] : 'abc'; my $i; my $contents = join '', map {++$i . ":$_"} <DATA>; $contents =~ s{(// .*? $)|/\* .*? \*/}{$1 ne '' ? "\n" : ' '}xmseg; $contents =~ s/"[^\\"]* (?: \\" [^"]* )*"//gx; foreach my $line (split "\n", $contents) { my ($ln) = $line =~ m/^(\d+)/; my @ids = $line =~ m/\b$searchword\b/g; print "$ln: @ids\n" if (@ids); } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { private int abc = 0; public Foo() { // 行コメント内なのでここの abc を無視 abc = 1; String s = "変数文字列内の abc これも無視"; } public String get() { return " 1'23\" abc " + abc + abc; // この場合後ろの abc のみヒット } } 文字列の消去のところだけ直すつもりだったのにまるきり変えてしまった…

mikaemi
ベストアンサー率50% (33/65)

2007/09/13 12:32 回答No.4

あぁ、abc を探すんでしたね。 my $pat = qr/\babc\b/; # 識別子 abc をサーチと変えてやると、一応、探します^^ ＝＝＝ $ ./cprog.pl 6: abc 10: abc 15: abc

mikaemi
ベストアンサー率50% (33/65)

2007/09/13 12:22 回答No.3

せっかく、\G を使っているのだから、 # コメントを除いてしまう while (m%(//|/\*)%g) { my $p = pos; s%//\G.*\n%\n% if $1 eq "//"; s%/\*\G.*?\*/% %s if $1 eq "/*"; pos = $p - 1; # 2 文字戻してコメント消去を、1 文字に置き換えるので } したほうが効率的でしたね。位置を 0 に戻すなら、/g を使う必要はなかったです^^;

質問者

補足 2007/09/14 22:44

mikaemi
ベストアンサー率50% (33/65)

2007/09/13 12:08 回答No.2

あっ、失礼しました。先ほどの実行結果は、cprog.pl というファイル名に入れていると仮定してです^^ ＝＝＝ cprog.pl #!/usr/bin/perl # 行番号を書き込んでおき(行番号の情報が必要なければいらない)、 # $file にすべて読み込んでしまう my $file; $file .= $. . ":" . $_ while <DATA>; $_ = $file; # コメントを除いてしまう while (m%(//|/\*)%g) { s%//\G.*\n%\n% if $1 eq "//"; s%/\*\G.*?\*/% %s if $1 eq "/*"; pos($_) = 0; # 必要ないかな？ } # 文字列を除いてしまう s/"(?:\\"|[^"])*"//g; my $pat = qr/(p[\w\d]+)/; # p で始まる識別子をサーチ my @lines = split "\n", $_; # キーワード検索 for (@lines) { my ($ln) = m/^(\d+)/; my @ids = m/$pat/g; print "$ln: @ids\n" if (@ids); } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { 　private int abc = 0; 　public Foo() { 　　// 行コメント内なのでここの abc を無視　　abc = 1; 　　String s = "変数文字列内の abc これも無視"; 　} 　public String get() { 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット　} } ＝＝＝＝ $ ./cprog.pl 5: public 6: private 8: public 14: public

mikaemi
ベストアンサー率50% (33/65)

2007/09/13 12:01 回答No.1

こんなので間に合いますか？＝＝＝＝ #!/usr/bin/perl # 行番号を書き込んでおく(行番号の情報が必要なければいらない) my $file; $file .= $. . ":" . $_ while <DATA>; $_ = $file; # コメントを除いてしまう while (m%(//|/\*)%g) { s%//\G.*\n%\n% if $1 eq "//"; s%/\*\G.*?\*/% %s if $1 eq "/*"; pos($_) = 0; # 必要ないかな？ } # 文字列を除いてしまう s/"(?:\\"|[^"])*"//g; my $pat = qr/(p(?:\w|\d)+)/; # p で始まる識別子をサーチ my @lines = split "\n", $_; # キーワード検索 for (@lines) { my ($ln) = m/^(\d+)/; my @ids = m/$pat/g; print "$ln: @ids\n" if (@ids); } __END__ /* * ここの abc はブロックコメント内なので無視する * */ public class Foo() { 　private int abc = 0; 　public Foo() { 　　// 行コメント内なのでここの abc を無視　　abc = 1; 　　String s = "変数文字列内の abc これも無視"; 　} 　public String get() { 　　return " 1'23\" abc " + abc; // この場合後ろの abc のみヒット　} } ＝＝＝ $ ./cprog.pl 5: public 6: private 8: public 14: public

正規表現検索