MapReduceっぽいものを触ってみたかったのでPythonで実装.参考にしたのは
Writing An Hadoop MapReduce Program In Python @ Michael G. Noll
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
あんま意味ないけどdefaultdictを使ったものと比較.
コード
以下mapper.pyとreducer.pyのコード.ほぼコピペ.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
import sys | |
# 標準入力 | |
for line in sys.stdin: | |
# 改行除去 | |
line = line[:-1] | |
# 分割 | |
title, author, moji, rubi = line.split('\t') | |
# 標準出力 | |
print '%s\t%s\t%s' % (moji, rubi, 1) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
from operator import itemgetter | |
import sys | |
current_word = None | |
current_count = 0 | |
word = None | |
# 標準入力 | |
for line in sys.stdin: | |
# 改行除去 | |
line = line[:-1] | |
# mapper.pyからの入力のパース | |
moji, rubi, count = line.split('\t') | |
word = '%s\t%s' % (moji, rubi) | |
# カウントの変換 | |
try: | |
count = int(count) | |
except ValueError: | |
# もし数値でなかったら行の無視 | |
continue | |
# this IF-switch only works because Hadoop sorts map output | |
# by key (here: word) before it is passed to the reducer | |
if current_word == word: | |
current_count += count | |
else: | |
if current_word: | |
# 標準出力 | |
print '%s\t%s' % (current_word, current_count) | |
current_count = count | |
current_word = word | |
# do not forget to output the last word if needed! | |
if current_word == word: | |
print '%s\t%s' % (current_word, current_count) |
以下defaultdictのコード.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python | |
# -*- coding: utf-8 -*- | |
import sys | |
from collections import defaultdict | |
d = defaultdict(int) | |
for line in sys.stdin: | |
title, author, moji, rubi = line[:-1].split('¥t') | |
d[moji,rubi] += 1 | |
for (moji, rubi), v in d.iteritems(): | |
print "%s¥t%s¥t%d" % (moji, rubi, v) |
実行結果
https://github.com/downloads/satomacoto/Playground/count.zip
% head count.txt obscene house ナンバー・ナイン 1 A l'odeur du soleil sur les lavandes douces. もうむらさきにうれているげな 1 Abaisse'〕 アベッセ 4 Autant de pluie autant de tristesse, Paris qui m'oppresse! くさくさするほどあめがふる 1 Aux figuiers qui 〔mu^riront〕, au vent qui passera, みなみのくにではいちじくが 1 Belle-vue de Tombeau ベル・ビュウ・ド・トンボウ 2 Bonjour Monsieur ボンジュール・ムッシュウ 1 But this fold flow'ret climbs the hill この花こそは山にも攀ぢよ 1 Cafe' カフエ 1 Cafe' カツフエ 1 % wc count.txt 260652 834471 6590672 count.txt
速度比較
% time cat ruby_rev.txt | python mapper.py | sort -k1,1 | python reducer.py > count_mapreduce.txt cat ruby_rev.txt 0.00s user 0.07s system 0% cpu 59.480 total python mapper.py 7.74s user 0.06s system 13% cpu 59.485 total sort -k1,1 79.92s user 0.43s system 86% cpu 1:33.18 total python reducer.py > count_mapreduce.txt 10.48s user 0.12s system 11% cpu 1:33.17 total
% time cat ruby_rev.txt | python count_defaultdict.py | sort -k1,1 > count_defaultdict.txt cat ruby_rev.txt 0.00s user 0.07s system 1% cpu 6.228 total python count_defaultdict.py 6.90s user 0.12s system 98% cpu 7.131 total sort -k1,1 > count_defaultdict.txt 6.21s user 0.06s system 46% cpu 13.438 total
Hadoopではやってない.つーか入れてない.やってみるか…
No comments:
Post a Comment