2012 Volume 26 Issue 4 Published: 15 August 2012
  

  • Select all
    |
    Review
  • Review
    LI Benyang,GUAN Yi,DONG Xishuang, LI Sheng
    2012, 26(4): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Classification is the main method to analyze the document sentiment polarity, but it is defected in its deficiency in integrating the structure features. A cascaded model for sentiment polarity analysis is proposed to address this issue, which consists of two levelsthe clause level and the document level. The document is first segmented into clauses which are classified into positive and negative categories by an Maximum Entropy model. Afterwards, these categories are combined with types and positions of clauses as features for document classification via the Support Vector Machine model. Meanwhile, a Single-label Cascade Model based on cross-validation is proposed. Experimental results prove that the accuracy of the proposed method is improved by 2.53 compared with traditional methods of sentiment classification.
    Key wordssentiment analysis, sentiment classification; cascade model; ME; SVM
  • Review
    LI Rui1,2,WANG Bin1
    2012, 26(4): 9-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of the internet, the text processing area is challenged to deal with web scale dataset. It is intractable for traditional approaches computing effectively on peta-scale data volumes. MapReduce emerged to address this issue with distributed and parallel processing methods, which has been widely recognized and studied both in the academic and in industry. In natural language processing, machine learning, large-scale graph processing and statistical machine translation, there have been many successful application of this technique. In this paper we first give a brief introduction to MapReduce, revealing its advantages, limitations, and differences with traditional techniques. Then we present a classification and summary to MapReduce applications in some aspects of text processing. Finally, we introduce the system and performance research of MapReduce and analyze possible applications of MapReduce in the future.
    Key wordstext processing; MapReduce; distributed computing; survey; Hadoop
  • Review
    ZHANG Jianfeng1,2, XIA Yunqing1, YAO Jianmin2
    2012, 26(4): 21-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblogging is a user-relationship based platform to assist user in sharing and gaining information. Via various client tools such as WEB and WAP, users are able to create short messages in less than 140 characters. As microblogging booms, microtext is made large scale. The research on the microtext has thus become an important topic. In this paper, a definition on microtext is first given. Then significance of this research is summarized. The state-of-the-art research work on the microtext is presented as well as microtext datasets and microtext systems.
    Key wordstwitter; language analysis; text processing
  • Review
    LI Qingsheng1,2,3,WU Qinxia1,3 ,WANG Lei1,3
    2012, 26(4): 28-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyzes the defiencies in the encoding of the oracle-bone-script as well as its corresponding input methods. A method of dynamic description on Oracle strokes is then presented. Based on the modern Chinese character coding and writing standard, this method uses directed strokes and meta-strokes to describe oracle-bone-script, Employing the extended coding regions and the external glyph-description library, this method solves the problem in input and output of the variant form or out-of-identification oracle-bone-scripts.
    Key wordsOracle stroke; glyph descripton; input; encoding
  • Review
    DUAN Lei, HAN Fang, SONG Jihua
    2012, 26(4): 34-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word extraction is of great importance in the research fields of natural language generation, computational lexicography, parsing, corpus linguistic, etc. To address the issue of automatic extraction of two-character word from ancient Chinese, this paper takes the “Records of the Grand Historian” corpus as an example, and uses the statistical methods that based on frequency, mutual information and hypothesis testing to extract two-character word, respectively. Then it compares and analyzes the results according to the manual marked result in detail. It paves the way for the scheme design for the two-character word extraction from ancient Chinese in different applications.
    Key wordsChinese information processing; Ancient Chinese; Records of the Grand Historian; two-character word; statistical model
  • Review
    ZHANG Ruixia1, ZHUANG Jinlin1, YANG Guozeng2
    2012, 26(4): 43-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese Message Structure Database, as an important component in HowNet, can be treated as a rule base for Chinese semantic analysis. The disambiguation of Chinese message structures is the first step in bring the base into practical application. In this paper, the Chinese message structures are firstly formalized and then divided into different priority levels. Afterwards,, four diverse disambiguation approaches are proposed, including the syntax list judgment, the graph compatibility matching, the graph compatibility computation and the semantic similarity computation based on examples. Finally, different disambiguation processes are designed according to the different priority levels. Experimental results prove the accuracy rate of the disambiguation yields more than 90%.
    Key wordsHowNet; Chinese message structure; disambiguation; graph compatibility; semantic similarity
  • Review
    PENG Weiming1, SONG Jihua2, WANG Ning1, KANG Mingji2
    2012, 26(4): 50-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional Chinese grammar is represented by Li Jinxis A New Chinese Grammar firstly. Lis grammar system, taking the sentence layout and the sentence component as its main characteristic, is called the sentence-based grammar. This paper firstly briefly reviews the development history of the Chinese grammar, and summarizes the main ideas and theoretical features of two schoolstraditional grammar and structural grammar. Then the paper analyzes the advantages and disadvantages of the main grammar systems in Chinese Information Process (CIP) from the view of the Chinese treebank, and compared them to traditional grammar to reveal the necessity of applying traditional grammar to CIP field. Finally, the paper discusses some key issues to be handled in the future of application.
    Key wordsChinese information processing; traditional Chinese grammar; Li Jinxis grammar; sentence-based; sentence layout; sentence component
  • Review
    Arzugul·XERIP1,3, Zokre·KADER2, Turghun·IBRAYIM2
    2012, 26(4): 61-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    The verb aspect category is one of the most complicated categories in Uighur language and, thus, remains as one of the hardest problems in Uyghur language processing. Computer processing of verb aspect category can only be done after resolving the grammatical categories such as tense, person, negative in Uighur language. But overlapping of verb aspect is hard to crack. The verb aspect suffixes of Uighur language are attached to the verb stem according to specific rules, which enables to describe the overlapping forms of Uyghur verb aspect in terms of finite state machine. An FSM can be firstly generated from right to left according to overlapping rules, then it can be transformed into DFA from left to right, during which the formal description of Uyghur verb aspect is realized.
    Key wordsUyghur language; verb; aspect category, finite state machine, formalization
  • Review
    YU Hongzhi, GAO Lu, LI Yonghong, ZHENG Wensi
    2012, 26(4): 66-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper investigates three representative dialects in three main dialect branches as subjects for language investigationsTibetan Lhasa dialect of Tibet branch, Xiahe dialect of Ando branch and Dege dialect of Kang branch. It summarizes their three dialects phonetic systems, including single consonants, compound consonants, monophthongs, compound vowels and consonant-tails, as well as their tones. In accordance with SAMPA rules, it establishes the machine reading IPA suitable with these Tibetan three dialects and designs SAMPA_ST automatic labeling system to realize the text-speech conversion. It provides necessary data for the research on speech prosodic feature analysis and speech projects.
    Key wordsSAMPA_ST; IPA; Tibetan
  • Review
    ZHANG Lupeng1, YI Mianzhu2, ZHOU Yun3
    2012, 26(4): 73-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the past 25 years, ambiguity studies in the field of Chinese language processing made significant progress, resulting in a lot of valuable findings. Based on research articles published in Journal of Chinese Information Processing, we try to explore into the state-of-the-art, characteristics, and trends in ambiguity studies in terms of their research objects and methodology. By grouping the research articles according to different time sapn, we conduct a quantitative analysis and critical review of the existing works from a multi-perspective view, trying to make some suggestions for future studies in this field.
    Key wordsambiguity; disambiguation; Journal of Chinese Information Processing; statistical analysis; research object; research methodology
  • Review
    SU Yan, JU Shengfeng, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2012, 26(4): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, sentiment classification has become a hot research topic in Natural Language Processing. In this paper, we focus on semi-supervised learning paradigm for this task where only small amount of labeled data with many unlabeled samples are available for learning. Specifically, we propose a novel approach to semi-supervised learning for sentiment classification based on random subspace method. First, various random subspaces of the feature space are dynamically generated; Then, co-training algorithm is applied to choose high-confidential samples from the unlabeled data with the subspaces as the different views. Finally, the trained model is updated with the new obtained high-confidential samples. Experimental study across four product domains shows that our approach clearly outperforms the static way of the subspace generation and achieves much better performances than many other existing approaches for semi-supervised sentiment classification. In addition, this paper also explores the issues of different feature subspaces numbers.
    Key wordssentiment classification; semi-supervised learning; feature subspace method
  • Review
    GU Zhengjia1, YAO Tianfang2
    2012, 26(4): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion mining based on the subjective text is a language technology widely used in various fields. This paper studies on the evaluation morpheme, employing SBV polarity transfer algorithm, anaphora resolution, ATT chain algorithm and mutual information algorithm to extract evaluated objects from corpus results of LTP. Different types of sentences are taken into consideration to identify the orientation of sentiment words. The effects of adverb and conjunction, especially the normal adverb, negative adverb and adverb “Tai” are discussed in detail. Finally, an overall solution is presented with low algorithm complexity, clear structure and easy to understand. However, due to the adoption of basic syntactic analysis and experience-based language pattern, the proposed solution is dependent on syntactic analysis results.
    Key wordsevaluated object; orientation; SBV polarity transfer algorithm; anaphora resolution
  • Review
    WANG Suge1,2, YIN Xueqian3, LI Ru1,2, ZHANG Jie3, LV Yunyun1
    2012, 26(4): 98-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the evaluation objects extraction form product review texts via the domain ontology, an incomplete information system for the product performance is established, which deals with the feature sentiment orientation by the feature weighting. A heuristic feature dimension reduction method is proposed based on discernibility matrix to reduce redundancy and data sparsity. K-Means clustering algorithm is utilized for realizing evaluation objects clustering. On the car review corpus, the proposed method produces the best performance after feature dimension reduction in a certainty extent in terms of the sentiment clustering of the evaluation objects.
    Key wordsincomplete information systems; evaluation object; ontology; feature dimension reduction; clustering
  • Review
    DAI Daming, WANG Zhongqing, LI Shoushan, LI Peifeng, ZHU Qiaoming
    2012, 26(4): 103-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment classification is to distinguish the text between the expressed sentiment categories, such as positive vs. negative or agree vs. disagree. This paper aims to perform unsupervised sentiment classification with only unlabeled data and a small scale of emotion words. In detail, we firstly adopted the emotion words to extract the automatically-labeled samples with high precision, and then used these samples with the unlabeled samples to perform semi-supervised learning for sentiment classification. Experimental results demonstrate that this approach can achieve a good performance for the task of sentiment classification in both product and hotel domains.
    Key wordssentiment classification; emotion words; unsupervised learning; co-training
  • Review
    ZHANG Yang, LU Rong, YANG Qing
    2012, 26(4): 109-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    Retweeting is a key mechanism for information diffusion in Microbloging services such as Twitter. It is the mechanism of retweeting that leads to the fast and wide diffusion of information in Microblogs. In addition, research on the characteristics of retweeting is of vital importance for many different fields such as viral marketing, political campaigns, breaking news detection and so on. In this paper, taking Twitter as an example, we investigate the retweeting mechanism in Microblogs by predicting whether a tweet will be retweeted. We analyze the importance of different features and apply the classification method with weighted features. The experiments show that the proposed method can predict a major fraction of tweets (nearly 86%), out-performing previous researches.
    Key wordstwitter; retweeting; feature-weighted model
  • Review
    CHEN Qingzhang, TANG Zhongzhe, WANG Kai, YAO Min, PEI Yujie
    2012, 26(4): 115-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of Internet, a large number of data of various type become huge and scattered. Using traditional keyword to search the data is more and more time-consuming. Therefore, the automatic recommender system emerges to reduce users search time and provide them with more appropriate information, . By using ART neural network and data mining technology, this study builds a typical online recommendation system. It can automatically cluster population characteristics and mine the associated characteristics. At the same time, MART algorithm is proposed as a modified ART algorithm for clustering algorithm, which produces more reasonable and flexible clustering results.
    Key wordsthe automatic recommender system; adaptive resonance theory; data mining technology; association rules
  • Review
    ZHANG Yue, ZHANG Hongli, ZHANG Weizhe
    2012, 26(4): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the wide application of BBS, blog and micro blog etc, how to rank the user becomes a well-recognized research issue,especially in the social network area. The paper analyzes the users relational graph in network BBS, representing the correlation graph by users and replies between users, and reveals its power law distribution in in/out degree. Owing the fact that the users behavior of posting and replying is in accordance with Pageranks characteristics of random walking and mutual-enforcement, the users influence is ranked with Pagerank algorithm. This paper further addresses the issue of time-space ratio in computation resulted by the exponent growth of users number. Based on the observation that over 80 percent users indegree is 0. Using list structure designs the efficient set-division-ranking arithmetic (SD-Rank) is designed by via list structure after dividing users into two sets0-indegree in set0 and non-0-indegree in set1. Through set partitions according to degree distribution, the time-space complexity of SD-Rank is decreased from O(V+E) to O(V′), in which V′ is the size of set1. Experiment on TIANYA BBS dataset shows that SD-Rank is more efficient than Pagerank.
    Key wordspower law; in degree; set division; quick rank