IBM Model 1 の実装 - y_uti のブログ

統計機械翻訳に関連して、このページに置かれている Koehn 先生のスライドを読んでみています。
http://www.statmt.org/book/

Word-Based Model のスライドに IBM Model 1 の説明が分かりやすく書かれていたので、スライド 29 ページの擬似コードを自分でも書いてみました。
スライドの擬似コードとの対応が分かりやすいように書いてみたつもりです。

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import codecs
import sys

from collections import defaultdict
from itertools import islice
from itertools import product

def main():
    sentence_pairs = load_sentence_pairs(sys.argv[1])
    translation_prob = init_translation_prob(sentence_pairs)
    for i in range(0, 10):
        # initialize
        count = defaultdict(float)
        total = defaultdict(float)
        for e_sentence, f_sentence in sentence_pairs:
            # compute normalization
            s_total = defaultdict(float)
            for e, f in product(e_sentence, f_sentence):
                s_total[e] += translation_prob[e, f]
            # collect counts
            for e, f in product(e_sentence, f_sentence):
                value = translation_prob[e, f] / s_total[e]
                count[e, f] += value
                total[f] += value
        # estimate probablities
        for e, f in translation_prob.keys():
            translation_prob[e, f] = count[e, f] / total[f]
    dump_translation_prob(translation_prob)

def load_sentence_pairs(filename):
    sentence_pairs = []
    fh = codecs.open(filename, 'r')
    for line in fh:
        e, f = line.strip().split('\t')
        sentence_pairs.append((e.split(), f.split()))
    fh.close()
    return sentence_pairs

def init_translation_prob(sentence_pairs):
    translation_prob = {}
    uniform_prob = calc_uniform_prob(sentence_pairs)
    for e_sentence, f_sentence in sentence_pairs:
        for e, f in product(e_sentence, f_sentence):
            translation_prob[e, f] = uniform_prob
    return translation_prob

def calc_uniform_prob(sentence_pairs):
    e_vocabulary = set()
    for e_sentence, _ in sentence_pairs:
        e_vocabulary |= set(e_sentence)
    return 1.0 / len(e_vocabulary)

def dump_translation_prob(translation_prob, n = 10):
    for (e, f), prob in islice(sorted(
            translation_prob.items(), key=lambda x: x[1], reverse=True), n):
        print e, f, prob

if __name__ == '__main__':
    main()

スライド 30 ページと同じようにコーパスを作ります。

$ cat ./sample.txt
the house       das Haus
the book        das Buch
a book  ein Buch

実行してみたところ、よい感じに学習できているようです。

$ ./model1.py sample.txt
the das 0.993305339717
book Buch 0.993305339717
house Haus 0.917235918993
a ein 0.917235918993
book ein 0.0827640810072
the Haus 0.0827640810072
house das 0.00461108875222
a Buch 0.00461108875222
book das 0.00208357153124
the Buch 0.00208357153124

サンプルデータだけでは面白くないので、日英のパラレルコーパスにも適用してみます。ここのページからダウンロードできる田中コーパスを貰ってきます。
http://www.edrdg.org/wiki/index.php/Tanaka_Corpus

$ wget http://www.csse.monash.edu.au/~jwb/examples.utf.gz

展開して、必要なところを切り出して、適当な大きさに縮めて、分かち書きして、くっつけて・・・

$ gzip -dc examples.utf.gz >examples.utf
$ grep ^A examples.utf | sed 's/^A: \([^#]*\).*/\1/' >je.txt
$ awk '{ if (NR % 100 == 0) print $0; if (NR / 100 == 1000) exit; }' je.txt >je_tmp
$ cut -f1 je_tmp | mecab -Owakati >j_tmp
$ cut -f2 je_tmp >e_tmp
$ paste j_tmp e_tmp >je_corpus.txt

実行してみます。だいたい良い感じになっていますが、「が」とか「と」とか、ところどころ駄目なようです。

$ ./model1.py je_corpus.txt
が There 0.900053906459
今日 today. 0.856005988828
明日 tomorrow. 0.842112007093
と island 0.802765158999
家 house 0.776382802877
先生 teacher 0.772713368431
カメラ camera 0.771160642094
船 ship 0.760161341032
父 father 0.747681897549
事故 accident. 0.744584534283