文本处理中的MapReduce技术

李 锐1,2,王 斌1

PDF(1762 KB)
PDF(1762 KB)
中文信息学报 ›› 2012, Vol. 26 ›› Issue (4) : 9-21.
综述

文本处理中的MapReduce技术

  • 李 锐1,2,王 斌1
作者信息 +

MapReduce in Text Processing

  • LI Rui1,2,WANG Bin1
Author information +
History +

摘要

用于文本处理的很多数据集已经达到TB、PB甚至更大规模,传统的单机方法难以对这些数据进行有效处理。近年来出现的MapReduce计算框架能够以简洁的形式和分布式的方案来解决大规模数据的并行处理问题,得到了学术界和工业界的广泛认可和使用。目前,MapReduce已经被用于自然语言处理、机器学习及大规模图处理等领域。该文首先对MapReduce做了简单的介绍,并分析了其特点、优势还有不足;然后对MapReduce近年来在文本处理各个方面的应用进行分类总结和整理;最后对MapReduce的系统和性能方面的研究也做了一些介绍与展望。

Abstract

With the development of the internet, the text processing area is challenged to deal with web scale dataset. It is intractable for traditional approaches computing effectively on peta-scale data volumes. MapReduce emerged to address this issue with distributed and parallel processing methods, which has been widely recognized and studied both in the academic and in industry. In natural language processing, machine learning, large-scale graph processing and statistical machine translation, there have been many successful application of this technique. In this paper we first give a brief introduction to MapReduce, revealing its advantages, limitations, and differences with traditional techniques. Then we present a classification and summary to MapReduce applications in some aspects of text processing. Finally, we introduce the system and performance research of MapReduce and analyze possible applications of MapReduce in the future.
Key wordstext processing; MapReduce; distributed computing; survey; Hadoop

关键词

文本处理 / MapReduce / 分布式计算 / 综述 / Hadoop

Key words

text processing / MapReduce / distributed computing / survey / Hadoop

引用本文

导出引用
李 锐1,2,王 斌1. 文本处理中的MapReduce技术. 中文信息学报. 2012, 26(4): 9-21
LI Rui1,2,WANG Bin1. MapReduce in Text Processing. Journal of Chinese Information Processing. 2012, 26(4): 9-21

参考文献

[1] J. Lin, D. Metzler, T. Elsayed, et al. Of Ivory and Smurfs: Loxodontan MapReduce Experiments for Web Search[C]//Proceedings of TREC, 2010.
[2] Michael, Daniel Abadi, David J. DeWitt, et al. MapReduce and Parallel DBMSs: Friends or Foes[J] Communications of the ACM, 2010, 53(1).
[3] Michael Isard, Mihai Budiu, Yuan Yu, et al. Dryad: distributed data-parallel programs from sequential building blocks[C]//Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, Lisbon, Portugal, 2007.
[4] J. Dean, S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters[J]. Communications of the ACM, 2008, 51 (1): 107-113.
[5] Jeffrey Dean, Sanjay Ghemawat. MapReduce: a flexible data processing tool[J]. Communications of the ACM, January 2010, 53(1).
[6] Jimmy Lin, Chris Dyer. Data-Intensive Text Processing with MapReduce[M]. 2010.
[7] R. M. C. McCreadie, C. Macdonald, I. Ounis. On single-pass indexing with MapReduce[C]//Proceedings of 32nd SIGIR, 2009: 742-743.
[8] R. McCreadie, C. Mcdonald, I. Ounis. Comparing Distributed Indexing: To MapReduce or Not?[C]//Proceedings of LSDS-IR, 2009, CEUR Workshop Proceedings, 80: 41-48.
[9] Jimmy Lin. Brute force and indexed approaches to pairwise document similarity comparisons with Map-Reduce[C]//Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, Boston, MA, USA: 2009.
[10] Daniel Peng, Frank Dabek. Large-scale Incremental Processing Using Distributed Transactions and Notifications[C]//Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
[11] Michal Laclavik, Martin Seleng, Ladislav Hluchy. Towards Large Scale Semantic Annotation Built on MapReduce Architecture[C]//Proceedings of ICCS 2008, M. Bubak et al. (Eds.), 2008. Part III, LNCS 5103: 331-338.
[12] Jurgen Van Gael, Andreas Vlachos, Zoubin Ghahramani, et al. The infinite HMM for unsupervised PoS tagging[C]//Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing(EMNLP), 2009.
[13] Christopher Dyer, Aaron Cordova, Alex Mont, et al. Fast, easy, and cheap: construction of statistical machine translation models with MapReduce[C]//Proceedings of the 3rd Workshop on Statistical Machine Translation, Columbus, Ohio, 2008: 199-207.
[14] Ashish Venugopal, Andreas Zollmann. Grammar based statistical MT on Hadoop: An end-to-end toolkit for large scale PSCFG based MT[J]. The Prague Bulletin of Mathematical Linguistics. 2009, 91:67-78.
[15] Qin Gao, Stephan Vogel. Training Phrase-Based Machine Translation Models on the Cloud: Open Source Machine Translation Toolkit Chaski[J]. The Prague Bulletin of Mathematical Linguistics. 2010,93: 37-46.
[16] Joo Gra a, Kuzman Ganchev, Ben Taskar. PostCAT-Posterior Constrained Alignment Toolkit[J]. The Prague Bulletin of Mathematical Linguistics. 2009,91: 27-36.
[17] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schutze,王斌(译). 信息检索导论[M]. 北京: 人民邮电出版社, 2010.
[18] Tamer Elsayed, Jimmy Lin, Douglas W. Oard. Pairwise document similarity in large collections with MapReduce[C]//Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies. Columbus, Ohio: 2008.
[19] Zhang Yuanfeng, Dong Shoubin, Zhang Ling, et al. Algorithm of Parallelized Elimination of Duplicated Web Pages Based on Map/Reduce[J]. Journal of Guangxi Normal University, 2007, 02.
[20] Qi Zhang, Yue Zhang, Haomin Yu, et al. Efficient partial-duplicate detection based on sequence matching[C]//Proceeding of the 33rd international ACM SIGIR conference on research and development in information retrieval, 2010: 675-682.
[21] C. Chu, S. Kim, Y. A. Lin, et al. Map-reduce for machine learning on multicore[C]//Proceedings of NIPS 19, 2007.
[22] Bryan Catanzaro, Narayanan Sundaram, Kurt Keutzer. Fast support vector machine training and classification on graphics processors[C]//Proceedings of the 25th international conference on machine learning, Helsinki, Finland, 2008: 104-111.
[23] B. Catanzaro, N. Sundaram, K. Keutzer. A map reduce framework for programming graphics processors[C]//Proceedings of Workshop on Software Tools for MultiCore Systems, 2008.
[24] Papadimitriou, S., Sun, J. Disco: Distributed co-clustering with map-reduce[C]//Proceedings of the IEEE International Conference on Data Mining (ICDM), 2008.
[25] M. Kearns. Efficient noise-tolerant learning from statistical queries[J]. Journal of the ACM, 1999: 392-401.
[26] Arthur P. Dempster, Nan M. Laird, Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm[J]. Journal of the Royal Statistical Society, 1977, Series B (Methodological), 39(1): 1-38.
[27] A. Das, M. Datar, A. Garg, et al. Google news personalization: Scalable online collaborative filtering[C]//Proceedings of the 16th International Conference on World Wide Web, New York, 2007: 271-280.
[28] David Newman, Arthur Asuncion, Padhraic Smyth, et al. Distributed Algorithms for Topic Models[J]. Journal of Machine Learning Research, 2009, 10: 1801-1828.
[29] Yi Wang, Hongjie Bai, Matt Stanton, et la. PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications[C]//Procceedings of AAIM, 2009, 5564: 301-314.
[30] Edward Y. Chang, Hongjie Bai, Kaihua Zhu. Parallel algorithms for mining large-scale rich-media data[C]//Proceedings of the 17th ACM international conference on Multimedia, 2009.
[31] Xiance Si, Maosong Sun. Tag-lda for scalable real-time tag recommendation[J]. Journal of Computational Information Systems, 2008.
[32] ChengXiang Zhai. Statistical Language Models for Information Retrieval[M]. Morgan & Claypool Publishers, 2008.
[33] Chao Liu, Hung-chih Yang, Jinliang Fan, et al. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce[C]//Proceedings of the 19th international conference on World Wide Web, 2010: 681-690.
[34] Flavio Chierichetti, Ravi Kumar, Andrew Tomkins. Max-cover in map-reduce[C]//Proceedings of the 19th international conference on World Wide Web, 2010: 231-240.
[35] J. Cohen. Graph twiddling in a MapReduce world[J]. Computing in Science and Engineering, 2009, 11(4): 29-41.
[36] de Kruijf, M, Sankaralingam, K. MapReduce for the Cell Broadband Engine Architecture[J]. IBM Journal of Research and Development, 2009, 53: 1-10.
[37] Torsten Hoefler, Andrew Lumsdaine, Jack Dongarra. Towards Efficient MapReduce Using MPI[J]. Lecture Notes in Computer Science, 2009, 5759: 240-249.
[38] Bingsheng He, Wenbin Fang, Qiong Luo, et al. Mars: a MapReduce framework on graphics processors[C]//Proceedings of the 17th international conference on parallel architectures and compilation techniques, Toronto, Ontario, Canada, 2008.
[39] Jaliya Ekanayake, Shrideep Pallickara, Geoffrey Fox. MapReduce for Data Intensive Scientific Analyses[C]//Proceedings of the 2008 4th IEEE International Conference on eScience, 2008: 277-284.
[40] Michael C. Schatz. CloudBurst[J]. Bioinformatics, 2009: 25(11): 1363-1369.
[41] R. McCreadie, C. Macdonald, I. Ounis, et al.. University of Glasgow at TREC 2009: Experiments with Terrier[C]//Proceedings of the 18th Text Retrieval Conference, Maryland: Gaithersburg, 2009.
[42] Andrew Pavlo, Erik Paulson, Alexander Rasin, et al. A comparison of approaches to large-scale data analysis[C]//Proceedings of the 35th SIGMOD International Conference on Management of Data, Providence, Rhode Island, USA, 2009.
[43] Jiang, BC Ooi, L Shi, et al. The Performance of MapReduce: An Indepth Study[C]//Proceedings of the VLDB Endowment, 2010.
[44] Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, et al. Evaluating MapReduce for Multi-core and Multiprocessor Systems[C]//Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, 2007: 13-24.
[45] Feuerlicht, G. Database Trends and Directions: Current Challenges and Opportunities[C]//Proceedings of DATESO 2010, tědrotín-Plazy, 2010: 163-174.
[46] Chen Y., Ganapathi A., Katz R.. To compress or not to compress-compute vs. io tradeoffs for mapreduce energy efficiency[R]. Electrical Engineering and Computer Sciences University of California at Berkeley, UC Berkeley, 2010.
[47] B. Cooper, A. Silberstein, E. Tam, et al. Benchmarking Cloud Serving Systems with YCSB[C]//Proceedings of the 1st ACM Symposium on Cloud Computing, 2010: 143-154.
[48] Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, et al. Map-reduce-merge: simplified relational data processing on large clusters[C]//Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007.
[49] Mohamad Al Hajj Hassan, Mostafa Bamha. Semi-join computation on distributed file systems using map-reduce-merge model[C]//Proceedings of the 25th Symposium On Applied Computing. ACM, 2010.
[50] E. Friedman, P. M. Pawlowski, J. Cieslewicz. SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions[C]//PVLDB, 2009, 2(2): 1402-1413.
[51] Andrew W. McNabb, Christopher K, et al. MRPSO: MapReduce particle swarm optimization[C]//Proceedings of the 9th annual conference on genetic and evolutionary computation, London, England, 2007.
[52] Tyson Condie, Neil Conway, Peter Alvaro. Online aggregation and continuous query support in MapReduce[C]//Proceedings of the 2010 International Conference on Management of Data, 2010: 1115-1118.
[53] J. Lin. The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce[C]//Proceedings of LSDS-IR, 2009, 80: 57-62.
[54] I. Roy, H. Ramadan, S. Setty, et al. Airavat: Security and Privacy for MapReduce[C]//Proceedings of NSDI, 2010.
[55] Jimmy Lin. Scalable language processing algorithms for the masses: a case study in computing word co-occurrence matrices with MapReduce[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, Hawaii, 2008.
[56] Rob Pike, Sean Dorward, Robert Griesemer, et al. Interpreting the data: Parallel analysis with Sawzall[J]. Scientific Programming, 2005, 13(4): 277-298.
[57] Laclavík, M.-eleng, M.-Ciglan, et al. Ontea: Platform for Pattern Based Automated Semantic Annotation[J]. Computing and Informatics, 2009,28(4): 555-579.
[58] J. Urbani, S. Kotoulas, E. Oren, et al. Scalable distributed reasoning using mapreduce[C]//Proceedings of the International Semantic Web Conference (ISWC), 2009.
[59] Y. Chen, L. Keys, R. Katz. Towards energy efficient MapReduce[J]. Technical Report UCB/EECS-2009-109, UC Berkeley, 2009.
[60] M. Zaharia, A. Konwinski, A.D. Joseph, et al. Improving MapReduce Performance in Heterogeneous Environments[C]//Proceedings of San Diego, CA: Proc. OSDI, 2008: 29-42.
[61] W Zhao, H Ma, Q He. Parallel K-Means Clustering Based on MapReduce[J]. Lecture Notes in Computer Science, 2009, 5931: 674-679.
[62] Bacchiani M, Beaufays F, Schalkwyk J, et al. Deploying GOOG-411: Early Lessons in Data, Measurement, and Testing[C]//Proceedings of Acoustics, Speech and Signal Processing, ICASSP 2008: 5260-5263.

基金

自然科学基金资助项目(61070111)
PDF(1762 KB)

682

Accesses

0

Citation

Detail

段落导航
相关文章

/