2016 Volume 30 Issue 3 Published: 15 June 2016
  

  • Select all
    |
    Review
  • Review
    WANG Chaochao,XIONG Deyi
    2016, 30(3): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Despite of the rapid development of natural language processing,studies at the syntax level can’t fully satisfy what we need. Semantics is becoming a hot research topic in the NLP community. As the key part of semantics,computational semantic compositionality receives more attention. In this paper,we classify the approaches towards computational semantic compositionality into two categories: linguistic methods and distributional methods. We present some typical studies for each category,including the recent deep learning approaches. After an introduction to these two approaches,we make a comparison between linguistic methods and distributional methods. Then we introduce several applications of computational semantic compositionality on sentiment analysis and machine translation. Finally,we provide some suggestions on future directions of computational semantic compositionality.
  • Review
    WU Juan,LI Ru,WANG Zhiqiang
    2016, 30(3): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic Role Labeling is traditionally concerned with searching the fillers of those explicit roles appearing within sentence boundaries, ignoring some implicit arguments called null instantiation. This paper addressed the issue of automatic null instantiation identification based on Chinese FrameNet. We first apply the rule-based approach to detect null instantiation, which is followed by a post filtering to remove errors. Then, to deal with the arguments, we present an approach to combining semantic types of frame elements and frame-to-frame relations. We conduct experiments on 164 Chinese discourses, and compared with some other method, our method can get a better result.
  • Review
    GU Jingjing,ZHOU Guodong
    2016, 30(3): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the pragress of discourse analysis,punctuation researches have become an important entry to the analysis and disambiguation of discourse. Effective identificaton of the role of a punctuation in sentences;will help the development of syntax analysis,discourse analysis and other natural language processing technologies. The main task of this paper is to annotate and identify Chinese colon automatically. We adopt rule-based method and maximum entropy method. Rule-based method is relatively simpler and easier to implement. The maximum entropy method uses these rules into statistics,and gets better results in the experiments.
  • Review
    LIU Dongning,DENG Chunguo,TENG Shaohua,ZHANG Wei,LIANG Lu
    2016, 30(3): 23-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Now natural language processing has shifted from syntactic/lexical level to lightweight semantic level. As for the natural language processing of Chinese narrative sentences,the traditional method is using Lambek calculus,Which to process the Chinese statements with a flexible word order. And the present methods,such as adding modal words or new conjunctions,are not suitable for computer processing because they will increase the complexity of the NP-hard Lambek calculus. In response,this paper puts forward the λ-Lambek calculus,which uses Lambek calculus for the syntactic calculus of Chinese statements,and builds the lightweight semantic model of Chinese statements by Curry-Howard theory and λ-calculus. The λ-Lambek calculus can not only process the lightweight semantic calculus for Chinese statements,but also process the statements of flexible word order in Chinese.
  • Review
    PENG Weiming,SONG Jihua,WANG Ning
    2016, 30(3): 30-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the basic concept of formal syntactic analysis,exploring some formalization issues in Chinese syntactic analysis from multiple linguistics perspectives: language/speech,description/explanation,hierarchical/linear,phrases/sentence patterns,as well as lexical/syntactic,ztalso introduces some experiencs,principles and problems summarized from the formalization practice of sentence-based syntactic analysis.
  • Review
    HUANG Lan,DU Youfu
    2016, 30(3): 36-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic word relatedness measures are fundamental to many text analysis tasks such as information retrieval,classification and clustering. As the largest online encyclopedia today,Wikipedia has been successfully exploited for background knowledge to overcome the lexical differences between words and derive accurate semantic word relatedness measures. In Chinese version,however,the Chinese Wikipedia covers only ten percent of its English counterpart. The sparseness in concept space and associated resources adversely impacts word relatedness computation. To address this sparseness problem,we propose a method that utilizes different types of structured information that are automatically extracted from various resources in Wikipedia,such as article's full-text and their associated hyperlink structures. We use machine learning algorithms to learn the best combination of different resources from manually labeled training data. Experiments on three standard benchmark datasets in Chinese showed that our method is 20%-40% more consistent with an average human labeler than the state-of-the-art methods.
  • Review
    CHE Chao,ZHENG Xiaojun
    2016, 30(3): 46-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is difficult to extract term translation pairs from the parallel corpus of historical classics due to lack of proper word segmentation for ancient Chinese. In this paper we introduce a term alignment method using maximum entropy model based on sub-words. In our approach,we first extract word pairs as sub-words by chi-square statistics and log-likelihood ratio test, and apply them to segment Chinese. Then we build transliteration features according to the transliteration model of classics terms, and perform term alignment through maximum entropy. The use of sub-words addresses the lack of word segmentation method for ancient Chinese and the maximum entropy model integrating three kinds of features deals with the polysemy of terms. The experiments on the parallel corpora of Shi Ji show the effectiveness of the sub-words by a large improvement in performance compared to the IBM Model 4.
  • Review
    LIU Ying, CAO Xiang
    2016, 30(3): 52-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entropy model is used to align English-Chinese person name for English-Chinese parallel corpus. The model makes use of person name dictionary, surname dictionary, word alignment probability, co-occurrence feature, transliteration similarity based on minimum edit distance and transliteration similarity based on Metaphone. The experimental results show this method can achieve better precision and recall rate for large parallel corpus. We also investigate the alignment errors in English-Chinese person names and suggest possible solutions.
  • Review
    Seyyare Imam, Hussein Yusuf, Abdusalam Dawut
    2016, 30(3): 60-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    A rule based normalization method for Latin transcriptions of Uyghur Characters popular in the WEB is presented. First, we establish the large scale text corpus including four different types of datasets, i.e.set of the fixed words, set of the word-initial letter sequences, set of the suffix letter sequences, and set of the special words. Then we normalize the Uyghur Latin transcriptions by the characteristics of the letter sequence within a word and context information of adjacent letters via the Minimum Edit Distance. Finally, a detailed analysis of the experiment results and the further researches are also given in this paper.
  • Review
    Muheyat·Niyazbek,Kunsaule·Talp
    2016, 30(3): 68-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper reports a statistical method of identification of IT terms in Kazakh. It builds a maximum entropy model,and followed by a rule based post-processing. The experimental results reveals an accuracy of 82.6% in the close test.
  • Review
    CHEN Xiaoying, AI Jinyong
    2016, 30(3): 74-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of information technology in Tibetan language,Tibetan transcription into Latin becomes an important issue. This article designs and realizes such a transcription system that based on a small character set. According to the Tibetan orthography knowledge,the paper proposes the transcription algorithm according the characteristics of a small Tibetan/Latin character set encoding. The implementation of the Tibetan Latin transcription system can solve compatibility issues between the different Tibetan codes.
  • Review
    Nurmemet Yolwas,ZHANG Liwen,Wushour Silamu
    2016, 30(3): 79-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    Researches show that pronunciation differences between the speakers can cause serious effects on the Uyghur speech recognition system. Focused on the speaker adaptation technology,this paper applies MLLR,MAP and MLLR+MAP methods to the training of acoustic models of Uyghur Continuous Speech Recognition system. Experimental results show that with the three speaker adaptation methods,the word error rate is reduced by 0.6%,2.34% and 2.57%,respectively.
  • Review
    XUAN Longyun,CUI Rongyi
    2016, 30(3): 85-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    By discussing the necessity of the Korean language information processing,Korean IT standardization status quo at home and abroad is analyzed in this paper. In this paper,it is suggested that the unified Chinese minority languages information technology infrastructure standard system deserves improvment. The Korean IT standardization has far-reaching significance for Chinese Korean cultural heritage and its development,which is an indispensable part of complete unified Chinese language information processing platform.
  • Review
    Aishan Wumaier,Tuergen Yibulayin,Kahaerjiang Abiderexiti,
    Zaokere Kadeer,Maihemuti Maimaiti,Yashen Aizezi
    2016, 30(3): 90-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a Uyghur Chunk parsing scheme,and extracts chunks from 3000 annotated sentences. According to the characteristics of Uyghur language,additional features on the stem,affixes,synonyms etc are augmented. 3000 marked sentences are constructed,and the cross-validation experiments at the training/testing ration of 9∶1,8∶2,2∶1 result in the recall rates of 80.34%,76.87% and 66.76%,respectively.
  • Review
    WANG Dandan,HUANG Degen,GAO Yang
    2016, 30(3): 96-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    The English-Chinese name transliteration can be described as syllable-based translation,which can be solved by current a phrase-based statistical machine translation model. After describing a detailed rule-based syllabification method,this paper presents a translation phrase table optimization by frequency thresh-hold and c-value. In addition,the method is also featured by integrating the local features of Chinese names,as well as a two-stage of syllabification strategy. The experimental results show that the performance of the English-Chinese name transliteration is improved from 63.78% to 67.56% in terms of ACC.
  • Review
    YANG Nan,LI Mu
    2016, 30(3): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long distance reordering is a major challenge in statistical machine translation. Previous work has shown that pre-reordering is a promising way to tackle this problem. In this work,we extend this line of research and propose a neural network based pre-reorder model,which integrates neural network modeling into a linear ordering framework. The neural network based model can leverage syntactic and semantic information extracted from unlabeled data to predict the word order difference between languages. Experiments on Chinese-English,and Japanese-English machine translation tasks show the effectiveness of our approach.
  • Review
    YANG Shuanglong,LV Xueqiang,LI Zhuo,XU Liping
    2016, 30(3): 111-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese patent literatures contain abundant domain-specific terms, and automatic recognition of terminology is an important task in information extraction and text mining. In this paper, we propose an approach of automatic generation of term formation rules and a novel TermRank algorithm. Firstly, we focus on generating a set of term formation rules automatically through a large number of patent titles and then applied those rules to patent texts for term candidates. Finally, the TermRank algorithm decides the final terms. Experimental results on 9725 Chinese patent literatures demonstrate the effectiveness of the proposed approach.
  • Review
    SUN Shuihua,HUANG Degen,NIU Ping
    2016, 30(3): 118-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    A term extraction algorithm model based on language rules in TCM acupuncture domain is established. Firstly,the seed set of TCM acupuncture domain term is iterated finitely to generate the component set. Secondly, by regarding the component set as the domain dictionary,the model applies maximum forward matching algorithm to segment the sentences and extracts term candidates. Finally,the term candidates are filtrated by rules. The F-measures for open test are 76.96% and 35.59%,with keywords and traditional Chinese medicine dictionary as the seed set,respectively.
  • Review
    CAO Cong,CAO Cungen,ZANG Liangjun,WANG Shi
    2016, 30(3): 125-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    A large-scale commonsense knowledge is indispensible for intelligent machine,and commonsense knowledge acquisition has always been an important research area of artificial intelligence. This paper presents an interactive method to guide the contributors to give event-based commonsense knowledge. The process of knowledge acquisition is interactive: machine dynamically generates questions to a contributor,and the human presents commonsense knowledge by his answeres. In addition to the prompt information,seven types of questions are presented in a progressive order to guide the knowledge contributors to think,which also brings more interest to the contributing process. The results show that the interactive method increases the number of knowledge by 451.61% with accuracy of 92.5%.
  • Review
    TIAN Weidong, YU Yongyong
    2016, 30(3): 133-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Even though Conditional Random Field(CRF) model can automatically tag focus in question,some deep relationships among focuses still cannot be mined,and this results in nontrivial impairing on focus recognition. In this paper,a focus recognition method based on frequent dependency tree pattern of Chinese question is proposed. In this method,probabilities of various dimensional relationships of focus hidden in the dependency tree corpus are mined to improve the recognition accuracy. The main steps of the method include mining frequent subtree dependency model to generate the corresponding statistical rules,using CRF for initial focus annotation,and using frequency dependent subtree statistical rules to correct initial annotation etc. The experimental results show that the proposed method can improve the accuracy by 3% or so in average compared to CRF model.
  • Review
    LI Pei,WENG Wei,LIN Chen,
    2016, 30(3): 143-151.
    Abstract ( ) PDF ( ) Knowledge map Save
    As microblog becomes an major web medium for individuals sharing and spreading instant information,mining the event evolution on microblogs arises to be a practical task. In this paper,we exploit Minimum-Weight Connected Dominating Set and Directed Steiner Tree to generate storyline from microblog for user input queries. Our framework consists of three stages: 1)Construction of a multi-view graph on the relevant Microblogs obtained by Lucene for user's queries; 2) Selection of the representative microblogs by finding the Minimum-Weight Connected Dominating Set; and 3) Connection of the microblogs as search for the Directed Steiner Tree. Experiments on real datasets demonstrate the efficiency and effectiveness of the proposed framework.
  • Review
    LIN Mingming,QIU Yunfei,SHAO Liangshan
    2016, 30(3): 152-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    A fuzzy quantification based on the three-dimensional coordinate is presented for microblog sentiment classification. Firstly,we define and divide the microblog sentiment into six classes,and calculate the fuzzy sentiment. Secondly,we construct the three-dimensional coordinate based on the sentiment classes,and mapped the sentences into the three-dimensional. Finally,we decide the sentiment classification of the sentence according to the angle between the sentence and axis. Through the experiment on classifying three authors' microblogs,the F-measure reaches more than 85%,outperforming three classical algorithms.
  • Review
    ZHANG Han,SHENG Yaqi,LV Chen,JI Donghong
    2016, 30(3): 163-171.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper improves the identification of textual entailment based on short text latent semantic features. The method trains a reliable latent variable model on sentences,and gets the sentence similarity features. The short text latent semantic features,combined with other string features such as word overlap,N-gram overlap,cosine similarity,etc,and lexical semantic features such as unlabeled sub tree overlap,labeled sub tree overlap,are used to identify textual entailment using SVM. We test on RTE-8 task,and the result shows that the latent semantic features are helpful to recognize textual entailment.
  • Review
    WANG Xiaochun,LI Sheng,YANG Muyun,ZHAO Tiejun
    2016, 30(3): 172-177.
    Abstract ( ) PDF ( ) Knowledge map Save
    Personalized information retrieval tailors the ranking of documents by taking into account individual interests,which has long been recognized as promising in improving the search experience. In order to improve personalized retrieval performance,this paper presents a general method of combining long-term and short-term interest to improve the query model. Tested on a large-scale real search log of a commercial search engine,our method can capture the individual information needs more accurately and significantly outperforms the state-of-the-art method.
  • Review
    SUN Shuqi, SUN Ke, ZHAO Shiqi, LI Sheng, WANG Haifeng, YANG Muyun
    2016, 30(3): 178-186.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the emerging entity-oriented search service, an accurate prediction of the relatedness between entities is essential. This paper proposes an approach to compute entity relatedness based on entities' fact knowledge, i.e., subject-property-object (SPO) records. We adopt a two-step estimation based on property and object, mapping an entity to a discrete distribution of the object words, and obtained two entities' relatedness by comparing the object words they share. On the related entity re-ranking problem in entity-oriented search, experimental results showed that our approach achieves 80.9% top-5 precision on average, outperforming the bag-of-words and query log co-occurrence based approaches. We also conducted quantitative analysis to find out how user demand in different domains affects the relatedness computation.
  • Review
    LU Xiao,LI Peng,WANG Bin,LI Yingbo,FANG Jing
    2016, 30(3): 187-195.
    Abstract ( ) PDF ( ) Knowledge map Save
    In contrast to the existing social relationship based micorblog recommendation,this paper analyzes the topic level of user interaction,and proposes a new method to measure the strength of this relationship. We infer the topic of the interaction relationship,and propose IBCF as an improved microblog recommendation model. Experimental results show that,compared with the current popular social recommendation methods,the proposed method performs better according to MAP and NDCG,generating more reasonable recommended results.
  • Review
    ZHOU Qiang
    2016, 30(3): 196-203.
    Abstract ( ) PDF ( ) Knowledge map Save
    Predicate lexicon is the core resource of analyzing deep grammar. In contrast to the exsisting manual construction methods,this paper proposes a new method of generating the predicate lexicon for Combinatory Category Grammar (CCG) from multi-resources. This method extracts semantic and syntactic features from HowNet,PKU_GD and large scale Event Patterns,generating CCG prototype and then assigning it to part of predicate whose all features and information are overlaped. Then an expanded predicate lexicon is generated by merging the result of classification and membership analysis. For the finally achieved predicate lexicon with 15 thousands predicates,the evaluation on a standard set annotated independently by multiple humans with 1000 homogeneous distributed predicates shows that its precision can achceve 96.3%.
  • Review
    WANG Ting,XU Tiansheng,JI Fujun
    2016, 30(3): 204-212.
    Abstract ( ) PDF ( ) Knowledge map Save
    The current research on the linked open data(LOD) mainly focused on level of instances,while the task on finding schema-level links between LOD datasets is ignored. In order to solve the large-scale Chinese ontology mapping problem occurred in LOD,we propose a data field and sequence alignment-based ontology mapping architecture. Firstly,based on an improved nuclear field potential function,we compress dimension of unaligned large-scale Chinese ontology. Secondly,we use the sequence alignment algorithm to compute similarity between concepts. Compared to other typical similarity computing algorithms,the experimental results show that the proposed method has higher overall performance and usability.
  • Review
    ZHANG Chunju,ZHANG Xueying,WANG Shu,LIAO Jianping,CHEN Xiaodan
    2016, 30(3): 213-222.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text has become an important data source of geo-spatial information. Currently,researches on structured geo-spatial information expression focused on extraction of spatial information,such as place names and spatial relations in text. However,abundant temporal information,event information and spatial-temporal information are ignored. In this paper,annotation of spatial-temporal information of event in Chinese text is proposed. Firstly,the linguistic characteristics of spatial-temporal information of event in Chinese text are analyzed. Then,an annotation schema is presented,and the annotation specification is decribed in detail.Finally,GATE (General Architecture for Text Engineering) is introduced as the annotation platform,and a large-scale annotated corpus based on the Web data source is developed and evaluated. This study effectively addresses the current lack of related specification and standard data for interpretation of event and spatial-temporal information in Chinese text.