『言語処理 100 本ノック』に PHP で挑む (問題 70 ~ 72)

『言語処理 100 本ノック』に PHP で挑戦しています。昨年の 9 月に第 5 章を終えたところで間隔が空いてしまいました。第 6 章と第 7 章は飛ばして、今回から第 8 章に取り組みます。今回の記事では、機械学習の前処理に相当する問題 72 まで進めます。
www.cl.ecei.tohoku.ac.jp

70. データの入手・整形

文に関する極性分析の正解データを用い，以下の要領で正解データ（sentiment.txt）を作成せよ．

rt-polarity.posの各行の先頭に"+1 "という文字列を追加する（極性ラベル"+1"とスペースに続けて肯定的な文の内容が続く）

rt-polarity.negの各行の先頭に"-1 "という文字列を追加する（極性ラベル"-1"とスペースに続けて否定的な文の内容が続く）

上述1と2の内容を結合（concatenate）し，行をランダムに並び替える
sentiment.txtを作成したら，正例（肯定的な文）の数と負例（否定的な文）の数を確認せよ．

まず、本章で利用するデータをダウンロードして展開します。展開すると rt-polaritydata ディレクトリに rt-polarity.neg, rt-polarity.pos という二つのデータファイルが得られます*1。それぞれ、各行が一つのレビューになっています。下記の実行例で分かるように、すべて小文字に変換されていたり、コンマやピリオド等の記号が空白で分離されていたりと、あらかじめ多少の処理が施されているようです。

$ wget http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz
$ tar xf rt-polaritydata.tar.gz
$ head -n 3 rt-polaritydata/rt-polarity.*
==> rt-polaritydata/rt-polarity.neg <==
simplistic , silly and tedious .
it's so laddish and juvenile , only teenage boys could possibly find it funny .
exploitative and largely devoid of the depth or sophistication that would make watching such a graphic treatment of the crimes bearable .

==> rt-polaritydata/rt-polarity.pos <==
the rock is destined to be the 21st century's new " conan " and that he's going to make a splash even greater than arnold schwarzenegger , jean-claud van damme or steven segal .
the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson's expanded vision of j . r . r . tolkien's middle-earth .
effective but too-tepid biopic

問題文のとおりにファイルを処理するプログラムを作成します。以下のように作成しました。

<?php

main();

function main()
{
    $pos = read_polarity_data('rt-polaritydata/rt-polarity.pos', '+1');
    $neg = read_polarity_data('rt-polaritydata/rt-polarity.neg', '-1');

    $data = array_merge($pos, $neg);
    shuffle($data);

    foreach ($data as $line) {
        echo $line, "\n";
    }
}

function read_polarity_data($filename, $label)
{
    $lines = file($filename, FILE_IGNORE_NEW_LINES);

    $data = [];
    foreach ($lines as $line) {
        $data[] = "$label $line";
    }

    return $data;
}

以下のように実行して、出力結果を確認します。

$ php main.php >sentiment.txt
$ head -n 5 sentiment.txt
+1 beautifully observed , miraculously unsentimental comedy-drama .
+1  . . . a series of tales told with the intricate preciseness of the best short story writing .
-1 a broad , melodramatic estrogen opera that's pretty toxic in its own right .
+1 features one of the most affecting depictions of a love affair ever committed to film .
+1 michael moore has perfected the art of highly entertaining , self-aggrandizing , politically motivated documentary-making , and he's got as potent a topic as ever here .

正例と負例のデータ数は次のように確認できます。

$ cut -f1 -d' ' sentiment.txt | sort | uniq -c
   5331 +1
   5331 -1

71. ストップワード

英語のストップワードのリスト（ストップリスト）を適当に作成せよ．さらに，引数に与えられた単語（文字列）がストップリストに含まれている場合は真，それ以外は偽を返す関数を実装せよ．さらに，その関数に対するテストを記述せよ．

ストップワードとは、自然言語処理の解析において有用な情報を持たない単語のことです。英語の場合は "a", "the" など、どんな文章にも含まれる一般的な単語が該当します。処理を行う前に入力文からストップワードを除去することで、解析精度や処理速度を向上させられます。

今回は http://www.ranks.nl/stopwords に掲載されている "Default English stopwords list" を利用しました*2。ウェブページの内容をコピーして、一行一単語のテキストファイルとして作成しておきます。

$ head stopwords.txt
#
# Default English stopwords list
#
# http://www.ranks.nl/stopwords
#

a
about
above
after

以下のように、文字列がストップワードのリストに含まれるか否かを返す関数を作成しました。グローバル変数を使うなど雑な実装ですが、気にしないことにします。read_stopwords 関数では、コメント行と空行を読み飛ばす処理を含めていますが、基本的にはテキストファイルの内容を配列に読み込んでいるだけです。

<?php

$stopwords = read_stopwords('stopwords.txt');

function read_stopwords($filename)
{
    $stopwords = [];

    $file = fopen($filename, 'rb');
    while (($line = fgets($file)) !== false) {
        $line = trim($line);
        if (strlen($line) !== 0 && $line[0] !== '#') {
            $stopwords[] = $line;
        }
    }
    fclose($file);

    return $stopwords;
}

function is_stopword($word)
{
    global $stopwords;
    return in_array($word, $stopwords);
}

assert 文を使って、この関数のテストを実装しました。

<?php

require_once __DIR__ . '/is_stopword.php';

assert(is_stopword('am') === true);
assert(is_stopword('main') === false);

assert(is_stopword('') === false);
assert(is_stopword('#') === false);

72. 素性抽出

極性分析に有用そうな素性を各自で設計し，学習データから素性を抽出せよ．素性としては，レビューからストップワードを除去し，各単語をステミング処理したものが最低限のベースラインとなるであろう．

素性とは、自然言語処理の解析に利用される特徴量のことです。解析に内容に応じて適切な素性を用いる必要がありますが、今回は問題文のとおりにストップワードの除去とステミングを行って素性とします。

ステミングとは、単語のうち活用によって変化する部分を取り除いて語幹を取得する処理のことです。英語では Porter のアルゴリズムが有名です*3。下記のウェブページに、さまざまなプログラミング言語での実装が公開されています。ここに PHP での実装もあるので、今回はこの実装を利用することにしました。
Porter Stemming Algorithm

ウェブサイトから PHP のコードを取得します。php.txt というファイル名になっているので、PorterStemmer.php に変更しておきます。

$ wget https://tartarus.org/martin/PorterStemmer/php.txt
$ mv php.txt PorterStemmer.php

ファイルにもコメントが書かれているように、以下のように利用できます。"beautiful", "beauty" の例でわかるように、出力される文字列は英単語にならない場合もあります。

<?php

require_once 'PorterStemmer.php';

echo PorterStemmer::stem('running'), "\n";    // run
echo PorterStemmer::stem('having'), "\n";     // have
echo PorterStemmer::stem('beautiful'), "\n";  // beauti
echo PorterStemmer::stem('beauty'), "\n";     // beauti

問題 71 で実装した is_stopword.php と、ダウンロードした PorterStemmer.php を利用して、入力文から素性を抽出する処理を次のように実装しました。入力文字列を空白で分割して配列に変換した後、ストップワードの除去とステミングを行います。最後に、単語をキーとして*4出現回数を値とする連想配列に変換します。この形式は bag-of-words と呼ばれ、自然言語処理でよく使われる表現形式です。

<?php

require_once __DIR__ . '/PorterStemmer.php';
require_once __DIR__ . '/is_stopword.php';

function extract_feature($sentence)
{
    $words = explode(' ', $sentence);
    $words = array_filter($words, 'is_feature_word');
    $words = array_map('stem', $words);
    $feature = array_count_values($words);
    ksort($feature, SORT_STRING);

    return $feature;
}

function is_feature_word($word)
{
    return strlen($word) != 0 && !is_stopword($word);
}

function stem($word)
{
    return PorterStemmer::Stem($word);
}

このプログラムを利用して、問題 70 で作成した sentiment.txt から素性を抽出します。sentiment.txt の各行は、先頭列がラベルで二列目以降がレビューになっています。これらを $label, $sentence として、$sentence から素性を抽出した後、$label と合わせて出力します。

<?php

require_once __DIR__ . '/extract_feature.php';

main();

function main()
{
    $data = read_sentiment_data('sentiment.txt');

    foreach ($data as list($label, $sentence)) {
        $feature = extract_feature($sentence);
        echo encode($label, $feature), "\n";
    }
}

function read_sentiment_data($filename)
{
    $lines = file($filename, FILE_IGNORE_NEW_LINES);

    $data = [];
    foreach ($lines as $line) {
        $data[] = explode(' ', $line, 2);
    }

    return $data;
}

function encode($label, $feature)
{
    $encoded = $label;
    foreach ($feature as $word => $count) {
        $encoded .= " $word $count";
    }

    return $encoded;
}

プログラムの実行例は以下のとおりです。

$ php main.php >feature.txt
$ head -n 5 feature.txt
+1 , 1 . 1 beautifulli 1 comedy-drama 1 miracul 1 observ 1 unsentiment 1
+1 . 4 best 1 intric 1 precis 1 seri 1 short 1 stori 1 tale 1 told 1 write 1
-1 , 1 . 1 broad 1 estrogen 1 melodramat 1 opera 1 pretti 1 right 1 toxic 1
+1 . 1 affair 1 affect 1 commit 1 depict 1 ever 1 featur 1 film 1 love 1 on 1
+1 , 3 . 1 art 1 documentary-mak 1 entertain 1 ever 1 got 1 highli 1 michael 1 moor 1 motiv 1 perfect 1 polit 1 potent 1 self-aggrand 1 topic 1

*1:データファイルのほかに README ファイルも含まれています。

*2:Google で "stopwords" というキーワードで検索して、最初に見つかったものです。

*3:言語処理 100 本ノックの第 6 章に、Porter のアルゴリズムによるステミングを行う問題があります。

*4:ステミング済みなので、厳密には語幹がキーとなります。

y_uti のブログ

統計、機械学習、自然言語処理などに興味を持つエンジニアの技術ブログです

『言語処理 100 本ノック』に PHP で挑む (問題 70 ~ 72)

70. データの入手・整形

71. ストップワード

72. 素性抽出