2015 Volume 29 Issue 2 Published: 10 March 2015
  

  • Select all
    |
    Survey
  • Survey
    LI Yegang, HUANG Heyan, SHI Shumin, FENG Chong, SU Chao
    2015, 29(2): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an overview of multi-strategy machine translation (MT). According to different level of combination the approaches to multi-strategy MT are classified into system-level combination and module-level combination. The representative method for each combination type are discussed in this paper, and the future development prospects of multi-strategy MT are also discussed.
  • Survey
    WEI Bingjie, WANG Bin, ZHANG Shuai, LI Peng
    2015, 29(2): 10-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of microblog, microblog retrieval has become one of the hot research areas in recent years. Firstly, in this paper, we analyze microblog documents and queries based on the TREC Microblog dataset. We found that, in contrast to traditional text retrieval, microblog search significantly differs in two ways. One is that microblog has its own characteristics compared to webpage. And the other is that microblog queries are time-sensitive, which means time information should be used in addition to traditional text similarity. According to these two differences, traditional text retrieval methods cannot be directly used in microblog search. Then, the related work on the two aspects of microblog retrieval is summarized. We described some microblog features and retrieval methods based on these features. According to the process of information retrieval, search models which use temporal information as the document priori or for query expansion or for text representation are also introduced. At last, we provide the conclusion and discuss the future work.
  • Survey
    SHI Liang, ZHANG Hong, LIU Xinran, WANG Yong, WANG Bin
    2015, 29(2): 24-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    The inverted index has been widely used as the core data structure in search engine, desktop search and digital library. by. To best compress it via the d-gap or the integer coding, the algorithm called Document Identifiers Reassignment is usually adopted to achieve a high locality in an inverted index. This paper first introduces the basic principle of index compression, and then focuses on state-of-the-art techniques on document identifiers reassignment with an analysis of the pros and cons. It also summarizes all the related work and discusses the future work of document identifiers reassignment.
  • Syntactic, Semantic Analysis
  • Syntactic, Semantic Analysis
    ZHU Muhua, WANG Huizhen, ZHU Jingbo
    2015, 29(2): 33-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    In practical applications such as parsing the Web, the shift-reduce parser is often preferred due to its linear time complexity. To be further comparable to the state-of-the-art parsers publicly available, this paper adopts the uptraining approach to improve the performance of the shift-reduce parser. The basic idea of uptraining is to apply a high-accuracy parser (such as the Berkeley parser used in this paper) to automatically analyze unlabeled data and then the new labeled data is applied as additional training data to build a POS tagger and the shift-reduce parser. Experimental results on Penn Chinese Treebank show that the approach can improve the shift-reduce parsing to 82.4% (with an absolute improvement of 2.3%), which is comparable to the Berkley parser on the same data and outperforms other state-of-the-art parsers.
  • Syntactic, Semantic Analysis
    QIAN Xiaofei, HOU Min
    2015, 29(2): 40-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an operational definition of Maximal Noun Phrase(MNP), and then analyzes its structure and distribution features. A MNP recognition based on baseNP reduction is also designed, which exploits the structural characteristics of MNP as well as the linguistic features such as initial definite references and semantic heads. This method eases the conflict between the long distance dependency of MNP and the limits of observation windows in classical models. The experiment indicates a good precision of 88.68% and a recall of 89.21%. The reduction method comprehensively improves system performance, especially it improves F1-score by 1% and optimal margin by 6% on multiword MNP, showing its efficiency in complex MNP recognition.
  • Syntactic, Semantic Analysis
    DING Changlin, BAI Yu, CAI Dongfeng
    2015, 29(2): 49-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    The semantic annotation is a promising solution to process the free texts of Ancient Chinese Medical Literature (ACML). Terms in such textx are further divided into Named Terms (NTs) and Descriptive Terms (DTs) in this paper. By analyzing the DT, this paper treat the annotating DTs as the problem of sequence labeling or classifying short sentences based on supervised learning. Two pre-processing methods named NT reduction and Hownet-based substitution are proposed. The experiments compare three learning models and four feature selecting methods, demonstrating the feasibility the proposed method.
  • Semantic Computing: Method and Application
  • Semantic Computing: Method and Application
    ZHANG Tao, LIU Kang, ZHAO Jun
    2015, 29(2): 58-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity linking is the task of map entity mentions in a document to their entities in a knowledge base (KB). In this paper, we briefly introduce the traditional entity linking system and point out the key problem of entity linking system-the semantic similarity measure between the content of entity mention and the document of the candidate entity. And then, we propose a novel semantic relatedness measure between Wikipedia concepts based on the graph structure of Wikipedia. With this similarity measure, we present a novel learning to rank framework which leverage the rich semantic information derived from Wikipedia to deal with the entity lining task. Experiment results show that the performance of the system is comparable to the state-of-art result.
  • Semantic Computing: Method and Application
    ZHANG Zhifei, MIAO Duoqian, YUE Xiaodong, NIE Jian-Yun
    2015, 29(2): 68-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    Some frequent sentiment words have strong semantic fuzziness, i.e., have ambiguous sentiment polarities. These words are particularly problematic in word-based sentiment analysis. In this paper, we design an approach to deal with this problem by combining rough set theory and Bayesian classification. To determine the sentiment polarity of a fuzzy word, we use a set of features extracted from its context of utilization. Decision rules based on the features are derived using rough sets. In case the rules fail to classify a case, a Bayes classifier is used as complement. We investigate the case of “HAO” in Chinese—a very frequent sentiment word, but with many different meanings. The experimental results on several datasets show that our combined method can effectively cope with the semantic fuzziness of the word and improve the quality of sentiment analysis.
  • Semantic Computing: Method and Application
    LI Ning, LUO Wenjuan, ZHUANG Fuzhen, HE Qing, SHI Zhongzhi
    2015, 29(2): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    PLSA((Probabilistic Latent Semantic Analysis) is a typical topic model. To enable a distributed computation of PLSA for the ever-increasing large datasets, a parallel PLSA algorithm based on MapReduce is proposed in this paper. Applied in text clustering and semantic analysis, the algorithm is demonstrated by the experiments for s its scalability in dealing with large datasets.
  • Machine Translation
  • Machine Translation
    WANG Kun, ZONG Chengqing, SU Keh-Yih
    2015, 29(2): 87-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    Under a framework of combining translation memory (TM) and statistical machine translation (SMT), this paper proposes to further dynamically add new phrase-pairs found in TM. During decoding, the integrated model adds those TM matched segments into the SMT phrase table as candidates dynamically, and incorporates corresponding TM information for each hypothesis to guide SMT decoding. Our experimental results show that the proposed approach improves translation quality significantly: compared with TM system, the integrated model achieves 21.15 BLEU points improvements and 21.47 TER points reduction; compared with SMT system, the integrated model achieves 5.16 BLEU points improvements and 4.05 TER points reduction.
  • Machine Translation
    SUN Shuihua,DING Peng,HUANG Degen
    2015, 29(2): 95-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    The phrase table lies at the core of a phrase-based statistical machine translation system. The extracted phrase table based on heuristic methods is affected by incorrect word alignments, the unaligned words, and the absence of syntactic information. This paper presents a bilingual syntactic phrases extraction method based on the Expectation-maximization algorithm,which can optimize all parameters by iteratiions. Three techniques are examined to integrate bilingual syntactic phrases to the phrase-based machine translation system: direct augmentation of bilingual phrass,adding new features and re-training. Experiments show that all the three methods improve the BLEU score to varying degrees,with the top increase of 0.64 BLEU score by adding new features.
  • Machine Translation
    HAN Fang,YANG Tianxin,SONG Jihua
    2015, 29(2): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a rule based Machine Translation for Ancient Chinese under the framework of sentence-focused syntax theory by Li Jinxi. The rule base also includes ancient Chinese Dictionary knowledge and word sense disambiguation knowledge. The whole translation process consists of the word sense selection the sentence syntax reordering. Utilizing a bi-gram model, sentences in the “Analects of Confucius” are translated and evaluated in the experment.
  • Other Language in/around China
  • Other Language in/around China
    Maimaitiyiming Hasimu, Wushouer Silamu, Weinila Mushajiang, Nuermaimaiti Youliwasi
    2015, 29(2): 111-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Unicode encoding consortium, Uyghur, Kazak and Kyrgyz characters are arranged in the Arabic characters area and mixed with Arabic characters. Some characters in these languages shares same code without language ID,which brings difficulty in information retrieval and natural language processing. After analyzing the unique characters, compound characters and the special features of some characters in certain language context, this paper designs a language identification algorithm of Uyghur, Kazak and Kyrgyz. The experimental results show that the accuracy achieves 96.67% for texts with 70 words or more.
  • Other Language in/around China
    XU Baolong, Nuermaimaiti Youluwasi, Wushouer Silamu
    2015, 29(2): 118-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    A good speech training corpus is essential for the wide application of continuous speech recognition. Therefore, whether more multiple voice phenomena are covered in the corpus is of substantial importance to improve the performance of speech recognition. In this paper, we collect a large number of spoken corpus sentences from a variety of Uighur spoken language communication platforms. Then, we refine the corpus according to the evaluation function considering the effect of co-articulation and applicability of the common words. The final corpus contain mor more balanced and efficient tri-phones, covering more phonetic phenomena, which lays a solid foundation for training a much accurate and reliable acoustic model.
  • Other Language in/around China
    ZHU Jie, LI Tianrui
    2015, 29(2): 125-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    Stop words processing is a key preprocessing step in the text mining. In this paper, the selection method of stop words in Tibetan based on statistics is studied by combining with the existing techniques. Through experiments, TF, DF, and entropy calculation methods in the selection of Tibetan stop words are analyzed. An approach for the selection of Tibetan stop words is presented by the combination of Tibetan function words, special verb and automatic approach. The experimental results show that the proposed method can determine a reasonable Tibetan stop words list.
  • Other Language in/around China
    TIAN Shengwei, ZHONG Jun, YU Long
    2015, 29(2): 133-141.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-word domain term extraction is an important issue in natural language processing. Combining the language features of Uyghur, a method of Uyghur multi-word domain terms extraction based on rules and statistics is proposed. The method is divided into four phases: ①corpora pre-processing, including the stop words filtering and part-of-speech(POS) tagging; ②obtaining N-gram substrings as the term candidates, by POS information and calculating internal associative strength via according to the modified mutual information and log likelihood ratio; ③enlarging the term candidates by utilizing the relative frequency difference; ④decide the final terms by C_value. The experimental results show the efficiency of the proposed method with a 85.08% precision and 73.19% recallin Uyghur multi-word domain terms extraction.
  • Text Information Processing
  • Text Information Processing
    LI Bo, WANG Jiangqing, Wei Hongyun, SUN Yangguang, WANG Xinnian, XU Ling
    2015, 29(2): 142-149.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Women’s Script is a unique written language found in China, which has no generally accepted normalized font yet. To deal with the low efficiency of traditional manual font normalization, this paper proposes a new automatic font normalization method for handwritten Women's Script. Firstly, the feature points and stroke segments of a handwritten character are extracted to establish the correlation matrix about connectivity and angle between the segments. Secondly, the strokes are restored by stroke segment correlation matrix, and the writing order of stroke is analyzed. Finally, the new normalized font is constructed by Bezier curve fitting based on the key points sequence in stroke path. Experimental results show that thismethod can improve efficiency to a great extent compared with the manual methods, generating fonts of smooth contours, undistorted strokes and with uniform thickness. Furthermore, this method can be applied in normalization for other kinds of handwritten scripts.
  • Text Information Processing
    MO Liping, ZHOU Kaiqing , JIANG Xiaohui
    2015, 29(2): 150-156.
    Abstract ( ) PDF ( ) Knowledge map Save
    Square Hmong language characters are typical representative of the folk Hmong language characters. The research on square Hmong language characters information processing is of great significance for protecting folk Hmong cultural heritage and carrying forward Hmong culture. Fonts development is an important part of the above research. According to the actual demand of the fonts development for square Hmong language characters, taking structure analysis as the foundation, the design idea for square Hmong language characters encoding scheme based on the Unicode standard was proposed, and the basic steps of making matrix fonts were introduced. Focusing on the labels definition, operators definition and transformation rules definition, the methods of designing and developing the square Hmong language characters fonts based on OpenType technology were discussed. Test results illustrate that the square Hmong language characters OpenType fonts file has the advantages of is small, easy to expand, etc., and can solve the hybrid layout problem of English, Chinese and the square Hmong language characters.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    FU Yan, XU Zhaobang, XIA Hu, ZHOU Junlin
    2015, 29(2): 157-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    Generally, the distribution of the subject information in the Web page is centralized .Therefore,we can utilize this characteristics of Web page to extract the subject information automatically. Due to the fact that the HTML label in the page source code is not well qualified, it is difficult to construct a DOM tree with accurate structure through the forward matching. This article presents a new method which applies the reverse matching to construct a complete DOM tree. By deleting the insignificant node the DOM tree, we can select from the remained information node labels manually to finalize the templeaterdeciden if they are unique. This is a general and semi- automatic method, and experiments on the e-commerce webpages are reported in this paper.
  • Information Extraction and Text Mining
    CHENG Nanchang, HOU Min, TENG Yonglin
    2015, 29(2): 163-169.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper takes the online product reviews as samples to investigate the characteristics and strategies in the attitude analysis of short texts. According to different performances of decisive factors of attitude polarity, the online review texts can be divided into four categories: the text containing overt summery sentence, the texts containing covert summary sentence, the texts containing characteristic words and the normal texts. Different strategies are established to deal with different types of texts, and a text attitude analysis system CUCsas is constructed based on dictionaries and rules. The system generates promising results in the Fourth Chinese Opinion Analysis Evaluation- COAE2012.
  • Information Extraction and Text Mining
    ZHANG Hengcai, LU Feng, QIU Peiyuan
    2015, 29(2): 170-178.
    Abstract ( ) PDF ( ) Knowledge map Save
    Micro-Blog messages usually contain a great amount of real-time traffic information which can be expected to become an important data source for city traffic. In this paper, we propose an approach for extracting traffic information from massive micro-blogs based on D-S evidence theory to solve the data fusion problem brought by micro-blogs characteristics of high dynamic, uncertainty and ambiguous narrating. Firstly, an evaluation index system for the traffic information collected from the mass micro-blog messages is built, whose accuracy is enhanced by use of a wikipedia semantic model. Secondly, a function of basic probability assignment is defined for the micro-blog messages with the help of word similarity. Finally, the D-S theory is adopted to judge and fuse the extracted traffic information, throught evidence composition and decision. An experiment on Beijing road networks and Sina Micro-blog platform shows the presented approach can effectively judge the reliability of the traffic information contained in mass micro-blog messages, and can utilize the message contents delivered by different micro-blog users at utmost. Meanwhile, compared with traditional text clustering algorithm, the proposed approach is more accurate.
  • Information Extraction and Text Mining
    ZHANG Jian, LI Fang
    2015, 29(2): 179-189.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic extraction of semantic information and its evolution from large-scale corpus has appealed to many experts and scholars in recent years. Topics are regarded as the latent semantic meanings underlying the document collectionand the topic evolution describes the contents of topics changing over time. This paper proposes a novel extraction method for the topics evolution and the topic relations based on the topic context. Since a topic often co-occurs with other topics in the same document, the co-occurrence information is defined as the context of a topic. Topics with its context are used not only to calculate the semantic relations among topics in the same period, but also to identify the same topics across different time periods. The experiments on NPC&CPPCC news reports from 2008 to 2012 and NIPS scientific literature from 2007 to 2011 have shown that the method has not only improved the results of topic evolution but also mined semantic relations among topics.
  • Information Extraction and Text Mining
    DUAN Jianyong,YAN Qiwei,ZHANG Mei,HU Yi
    2015, 29(2): 190-198.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual translation pairs play an import role in many NLP applications, such as cross language information retrieval and machine translation. The translation of proper names, out of vocabulary words, idioms and technical terminologies is one of the key factors that affect the performance of the systems. However, these translations can hardly be found in the traditional bilingual dictionary. This paper proposes a new method to automatically extract high quality translation pairs from Wikipedia based on the wide area coverage and data structure, the method not only can learn common patterns, but also learn many patterns that can hardly be found by human beings. The method contains three steps: 1) extract translation pairs from the language toolbox of the Wikipedia. They can be heuristic for the next step; 2) learn patterns of translation pairs with the knowledge of PAT-Array gained from the previous work; 3) extract other translation pairs automatically using the learned patterns. Our experimental results show the accuracy can reach 90.4%.
  • Information Extraction and Text Mining
    SHEN Xiaowei,LI Peifeng,ZHU Qiaoming
    2015, 29(2): 199-206.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pattern matching has been confirmed to be a simple and effective way in traditional information extraction, and dependency path is one of the most common patterns. There are a large number of researchers apply the pattern matching method based on dependency path in Slot Filling task. Focused on the issues of pattern balance, pattern extraction mode and pattern selection strategy in this task, this paper proposes some optimization strategies of pattern cutting, pattern reversing, pattern expansion and pattern semantic definition, and realizes a complete system. Tested in the TAC-KBP2010 target corpus, the F value of the proposed method achieves 20.8%, leading a 6.5% improvement against the 14.3% of the baseline system.