2015 Volume 29 Issue 3 Published: 08 May 2015
  

  • Select all
    |
    Language Analysis and Language Resource Construction
  • Language Analysis and Language Resource Construction
    ZHANG Kunli, ZAN Hongying, CHAI Yumei, HAN Yingjie, ZHAO Dan
    2015, 29(3): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    The contemporary Chinese function words with their distinct usages play complex syntax roles. The study on Chinese function words is of great significance in Chinese syntax analysis and semantic understanding. This paper firstly reviews the current research on Chinese function words and lexical knowledge base. Then it describes a triune construction on the knowledge base of modern Chinese function words i.e. the usage dictionary, the usage rule and the usage-annotated corpus. With the the finished knowledge base so far, the automatic usage recognition of the Chinese function words is investigated, with other potential applications discussed.
  • Language Analysis and Language Resource Construction
    QIU Likun, JIN Peng, WANG Houfeng
    2015, 29(3): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Treebank is an important resource for natural language processing. All the existing dependency treebanks and phrase structure treebanks might be taken as single-view treebanks. This paper proposed a schema for building a multi-view Chinese treebank based on dependency grammar. In this schema, we only need to annotate the head information and syntactic role of a child node, and then could infer the phrase structure function and hierarchy information of the phrase, which can greatly improve the efficiency of the labeling process without losing information. According to this schema, we built the treebank PKU Multi-view Chinese Treebank (PMT) version 1.0, which contains 64000 sentences and 1.4 million words, and supports the phrase structure grammar view and dependency grammar view.
  • Language Analysis and Language Resource Construction
    YU Shiwen, ZHU Xuefeng
    2015, 29(3): 16-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    Natural language processing oriented lexical semantics researches should be based on quantitative study of the lexicon. After a brief suvey on the main achievements of the quantitative Chinese lexicon, this paper proposes a project to build a knowledge base of commonly used words, for which we describe 1) a constructive definition of commonly used words list, 2) a quantitative method to measure the coverage of a given word list over an annotated corpus, and 3) the concept of “component word”. We also introduce the overall designs of the knowledge base and the current progress of this project. It is expected that the construction of such a knowledge base can contribute to the Chinese lexical semantics researches and the development of Chinese information processing.
  • Language Analysis and Language Resource Construction
    XUE Hongwu
    2015, 29(3): 21-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    With flexible morphological rules, Chinese is common to be observed for the disyllable word split by intserting some syntactical constituents of a sentence. To facilitate the word segmentation, part-of-speech tagging and semantic computing, the paper systematically discusses two types of the disyllable word split in Mandarin: one for the the Separable word, and the other is for the Prototype word. It examines describes the grammatical movitation of their formation, properties. and meanings, indicating that 1) the Prototype word split is caused by pragmatic factors, with subjective structures and grammatical meanings; yet 2) the Separable word split is caused by the syntactic and semantic expression, with objective structures and meanings.
  • Language Analysis and Language Resource Construction
    SONG Zuoyan, ZHAO Qingqing, KANG Shiyong
    2015, 29(3): 27-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    The analysis of compound nouns is always an important topic in linguistic study and natural language processing, which is involved with the recognition and automatic interpretation of Unknown Words (UWs), and dictionary compiling. With the up-to-date semantic theory named Generative Lexicon Theory, this paper builds a lexicon of compound nouns with semantic annotation. In addition to the annotation scheme, this paper also demonstrates its potential application in the word-forming and semantic analysis of compound nouns via a comparative analysis of compound nouns containing zhi(纸)and shi(石). It is revealed that qualia roles, natural types and artifactual types are important semantic information to disclose some patterns and rules of the formation and semantics of compound nouns.
  • Semantic Computing
  • Semantic Computing
    YANG Hua, JI Donghong, XIAO Guozheng
    2015, 29(3): 34-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic field is the semantic system composed of glosseme and the linkage among themselves. For a given language, all sub-semantic-field forms the whole semantic filed for that language. According to the conception of association semantic filed, we employ the complex network to represent Chinese semantic field. The scale-free distributions of node degree, node weight, and edge weight, are observed in this network. Some net-work unique language phenomena can be discovered by terms whose node degree, node weight, edge weight are in specific ranges. We demonstrate some specific phenomena detected, expecting further studies would provide reasonable explanations.
  • Semantic Computing
    XIANG Chuncheng, SUI Zhifang, ZHAN Weidong
    2015, 29(3): 44-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ontology matching is the key solution to the semantic heterogeneity problem.Focusing on the Noun concept of HowNet and CCD, this paper applies machine learning to identify the initial mapping relationships, disicussing the the feature selection, sample collections division and classifier selection. Further, employing the overall structure of the ontology, the similarity propagation algorithm is introduced to adjust the initial mapping globally. Experiment result shows that the precision of 1:1 and 1:n mapping relationships reaches 94% and 87.5%, respectively.
  • Semantic Computing
    CHEN Gang, LIU Yang
    2015, 29(3): 52-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature description and taxonomic description are two basic knowledge representations widely employed in lexical semantics. However, the the transformation between them remains an open issue with well discussion. In this paper, we applies the notion of ordering relationship into the feature description, and automatically derive a taxonomy from general to specific concepts, in which the previous undefined intermediate concepts are revealed. Experiments on HowNet (2000) show that a semantic taxonomy, with a fine-defined inheritance and a full coverage of all concepts, can be automatically generated by this approach. Further analysis of the output also indicates some underlined defects in the feature description for natural language knowledge engineering.
  • Semantic Computing
    XIONG Jing, ZHI Liping, YUAN Dong
    2015, 29(3): 58-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    In bridge the gap between words and syntactic components in current semantic annotation, a semantic annotation method based on ontology and dependency syntax for unstructured text is proposed. Applied in the sentence level, this method employs the features including POS, semantic dictionary, and other linguistic features, and determines the the lexical semantic relations by the dependency structure between them.. Meanwhile, an evaluation metric combing features like semantic similarity and semantic richness are designed, which is essentially the confidence of the method itself. Experimental results show that the semantic tagging algorithm can reach high accuracy especially on large-scale corpus.
  • Discourse Annotation and Reasoning
  • Discourse Annotation and Reasoning
    WANG Xun, LI Sujian, WANG Yuxin
    2015, 29(3): 65-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse Tagging is fundamental in natural language processing and helpful to a deep understanding of the texts. Many application tasks, such as automatic summarization, question & answering and so on, would benefit a lot from a thorough understanding of the text. On the basis of the existing discourse theories such as Rhetoric Structure Theory or Centering Theory, this paper designs a new discourse tagging system, which covers both the logical relations and text content or the practical needs of real natural language processing tasks.
  • Discourse Annotation and Reasoning
    WU Yunfang, XU Yifeng, WANG Kairan
    2015, 29(3): 71-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic discourse analysis has aroused strong interests in the recent years. Compared to the bulks of work on English discourse analysis, much less work has been done in Chinese discourse parsing. A non-negligible reason is that there is no well-annotated Chinese discourse corpus publically available. Under the RST-framework, this paper proposes an intra-sentence relationship annotation scheme for Chinese discourse analysis. We consider both the topic and the logic aspect, discriminating the attachment relationship and logic relationship in Chinese intra-sentence relationship. The logic relationship consists of 6 types and 15 subtypes. Up to now, we have annotated 8,000 sentences in the People Daily News. We check 1,000 sentences in a double-blind manner for the inter-annotator agreement, which may give a hint for the difficulties in this task. Based on the annotated data, we give some statistics analysis and demonstrate some challenges for Chinese automatic discourse analysis.
  • Discourse Annotation and Reasoning
    NI Shengjian, JI Donghong
    2015, 29(3): 82-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recognition of Textual Entailment (RTE) is of substantial significance to most natural language processing. This paper explores the schematic explanations to TE, revealing how (image) schemata can justify the TE results by case studies. Schemata include qualia structure, idealized cognitive model, and frame, script, etc., all of which are structures that can be used for representing word meaning. In a broad sense, all these kinds of schemata belong to the category of semantic feature and thus have the potential to become evidences for TE. Exploration into RTE based on schemata and the construction of corresponding corpora of schemata may contribute to solving the bottleneck issues in RTE.
  • Discourse Annotation and Reasoning
    YAN Weirong, ZHU Shanshan, HONG Yu, YAO Jianmin, ZHU Qiaoming
    2015, 29(3): 88-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse relation analysis is a task of natural language understanding which aimed at analyzing and disposing the semantic relation and rhetorical structure of discourse. Implicit discourse relation analysis is an important subtask of automatically detectind senses of semantic relation between arguments in the absence of direct cues. Currently, the performance of implicit discourse relation analysis is low and state-of-art accuracy can only reach 40%. The major cause of this situation is that the existing methods did not analyze arguments in the semantic frame, limited only to the local features and correlation analysis of arguments. This paper proposes a method of implicit discourse relation inference based on frame semantic. This method automatic recognised semantic frame of arguments through FrameNet and related identification technology. On this basis, we indentify the semantic relation of arguments by the distribution probability of frame semantic relation in large-scale text data. The experimental results show that, only using the first level of frame semantic can improve the detection performance of implicit discourse relation up to 5.14%; meanwhile, this method can make the accuracy rate increased by 10.68% in the case of considering the balance of relation categories.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    GUI Bin, YANG Xiaoping, ZHU Jianlin, ZHANG Zhongxia, XIAO Wentao
    2015, 29(3): 100-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Micro-blog as a new interaction social networking is rich in peoples opinions. Aiming at the Microblog sentiment orientation indetification,this paper proposes an algorithm based on the Sense Group partition.After an introduction to the concept of sense group, we propose the algorithm for the sense group partition. Then, together with the negative words, the degree words and punctuation, we establish the formula of sentiment identification based on the relationship between the sense groups. The experiments reveals an accuracy of 80.1%, outperformed the sentiment lexicon based approach and the SVM based method.
  • Sentiment Analysis and Social Computing
    XU Xinyi, LIU Gongshen
    2015, 29(3): 106-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of Internet, the text orientation identification and text mining in social network is becoming a hot research issue. In this paper, a text sentiment orientation identification method using textures is proposed. The feature reduction is conducted by mutual information between the texture features and the text orientations. Compared to sentiment orientation classification method based on word frequency, the proposed method is proved about 10% increase for precision on average.
  • Sentiment Analysis and Social Computing
    LIAO Jian, WANG Suge,LI Deyu, ZHANG Peng
    2015, 29(3): 113-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the online review sentiment polarity classification problem, a multi-level sentiment classification method is proposed based on bag-of-opinion model and a set of linguistic rules. According to the part-of-speech of each word in the sentences, 12 patterns are designed for the feature-opinion pairs extraction, which enable to represent the whole text in a series of four-tuple of “feature, degree word, opinion word, negation word”. After designing the estimation of the sentiment priority of the four-tuple, the cosine similarity is further adopted for a 5-level sentiment polarity classification. Experiments on the dataset from COAE2012 Task 3 car dataset indicate a good result compared to the performances of the other runs in COAE.
  • Sentiment Analysis and Social Computing
    XU Xueke,TAN Songbo,LIU Yue,CHENG Xueqi,WU Qiong
    2015, 29(3): 121-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper addresses the problem of learning aspect-dependent sentiment knowledge. Specifically, a novel topic model, called Joint Aspect/Opinion Model (JAO), is proposed to detect aspects and aspect-specific opinion words simultaneoasly in an unsupervised manner. Then, we propose to infer aspect-dependent sentiment polarity scores for these opinion words based on the hitting times from the words to a handful of positive/negative seed words, by applying Markov random walks over an aspect-specific word relation graph. Experimental results on restaurant review data show the effectiveness of the proposed approaches.
  • Sentiment Analysis and Social Computing
    LI Yaping, CAO Run, TONG Lu, LIANG Xun, NI Zhihao
    2015, 29(3): 130-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of network technology in Web2.0 age, the amount of social network website users has increased sharply. This paper colllects near 20 thousands users of Tencent Microblogging with their Microbloggings, and analyzes the patterns of user Content Generation of Tencent Microblogging. From perspectives of Microblogging content contribution, user activity over time and Microblogging influence, we examine the amount of Microblogging, ratio of original and repost content, number of content text, the weekly and daily patterns of Microblogging release, the repost number of Microblogging, the repost influence of Microblogging and the Microblogging contain ‘@’. Our analysis shows observations scuh as the users content contribution have “90-10”rule, different type of users have different “Microblogging style”, and users’ posting behavior exhibits strong daily and weekly patterns.
  • Sentiment Analysis and Social Computing
    GU Zhiyu, QIN Tao, WANG Bing
    2015, 29(3): 140-149.
    Abstract ( ) PDF ( ) Knowledge map Save
    The CPA (Cost-per-Action) Advertising is attracting more and more attention in both industry and research. Sponsored search based on CPA requires predicting conversion probability for each candidate ad during ad ranking, in order to raise conversion rate and optimize ad revenue for search engine. After extracting and analyzing features which may influence conversion of ads, we propose a probabilistic factor graph based model for ad conversion prediction which describes the relation between the conversion event and three factors, i.e. ad, query, and user. The model is evaluated and compared with Naive Bayesian method on real-world data gathered from a commercial search engine. The experiment demonstrates a good result in the ad conversion prediction, as well as different influences of the three factors.
  • Sentiment Analysis and Social Computing
    HE Long
    2015, 29(3): 150-154.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current review spam identification methods are focused on the feature selection, without addressing the imbalance of the data set. This paper presents a product review spam identification method based on the random forest, with the same number of samples extracted from the large and small class with replacement repeatedly, or with the same weight assigned to the large and small class. The experimental results on Amazon dataset show that the random forest method outperforms other baseline methods.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    XU Bo, LIN Hongfei, LIN Yuan, WANG Jian
    2015, 29(3): 155-161.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query Expansion is an important technique for improving retrieval performance. It uses some strategies to add some relevant terms to the original query submitted by the user, which could express the user’s information need more exactly and completely. Learning to rank is a hot machine learning issue addressed in in information retrieval, seeking to automatically construct ranking models determining the relevance degrees between objects. This paper attempts to improve pseudo-relevance feedback by introducing learning to rank algorithm to re-rank expansion terms. Some term features are obtained from the original query terms and the expansion terms, learning from which we can get a new ranking list of expansion terms. Adding the expansion terms list to the original query, we can acquire more relevant documents and improve the rate of accuracy. Experimental results on the TREC dataset shows that incorporating ranking algorithms in query expansion can lead to better retrieval performance.
  • Information Retrieval and Question Answering
    WANG Yue, LV Xueqiang, LI Zhuo, SHU Yan
    2015, 29(3): 162-168.
    Abstract ( ) PDF ( ) Knowledge map Save
    Search log name recognition has been a focus in Log Mining, which has direct impact on search engine’s retrieval efficiency and accuracy. The paper analyzes the drawbacks of name identification methods for long texts when applied to search logs, and proposes a method to identify Chinese names in search logs. The method employs the name internal word probability extracted from search query logs by the Conditional Random Fields, then estimates the credibility of person name according to the characteristics in the search log. Experimental results on Sogou query logs show that our approach reaches 81.97%accuracyand 85.81% recall on average, yielding F-measure of 83.79% .
  • Information Retrieval and Question Answering
    XUN Endong, RAO Gaoqi, XIE Jiali, HUANG Zhie
    2015, 29(3): 169-176.
    Abstract ( ) PDF ( ) Knowledge map Save
    Lexicon is the most active and time sensitive sub system of a language. During the evolution of a language, diachronic changes in vocabulary are focused by linguist, historian and sociologist etc. We collected large scale of corpora with a large time span, and developed the system of Diachronic Retrieval for Modern Chinese Word with natural language processing technology. It provides search indexes on frequency, cumulative sum, cumulative frequency etc., for possible studies on the semantics pragmatics and other aspects of the word.
  • Information Retrieval and Question Answering
    WEI BingJie, SHI Liang, WANG Bin
    2015, 29(3): 177-183.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of microblog, microblog retrieval has become a hot research topic in recent years. In contrast to traditional text retrieval, microblog search significantly differs in two aspects. One is that microblog has its own text features, i.e. short text and Hashtag as the theme term. The other is that microblog search should consider the time information and text and semantic similarity. This paper addresses the above issue by clustering to expand text content. The hashtag is introduced into the clustering, and, to guarantee its effect, a method to enrich the Hashtag in a microblog is described. Finally we used the time information as the documents prior and altogether three models are examined in the experments. Experiments on TREC Microblog dataset show that our models significantly improved MAP and P@30 with 7.1% and 11.6% increase separately.
  • Other Languages in/around China
  • Other Languages in/around China
    CHEN Xiaoying, AI Jinyong, YU Hongzhi
    2015, 29(3): 184-189.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes an empirical study on the voice characteristics of Lhasa Tibetan words. Based on the annotation of the monosyllabic voice in Tibetan Lhasa dialect, the acoustic parameters in of vowels and consonants are then extracted, followed by a statistical analysis of the pitch, open quotient and speed quotient. The results show that the voice parameters of different vowel and consonant are affected by the vocal style and the syllable position, and different vowels and syllables structure will affect the open quotient and speed quotient value.
  • Other Languages in/around China
    Yinhuahai, Nasun-urt
    2015, 29(3): 190-195.
    Abstract ( ) PDF ( ) Knowledge map Save
    “The Semantic Information Dictionary of Mongolian Noun” (“The Dictionary” hereafter) has come into its basic form since 2009. The progress is reflected by the expansion of its entries, the increase of its attributes, and its practical application in various systems. This paper introduces the development of this dictionary, and discusses these new progress and preliminary application of “The Dictionary”, with examples.
  • Other Languages in/around China
    ZHAO Weina, LI Lin,LIU Huidan, Pubudunzhu, WU Jian
    2015, 29(3): 196-200.
    Abstract ( ) PDF ( ) Knowledge map Save
    Trisyllabic verb phrases in Tibetan are flexible with complex structures. In this paper, an algorithm for the automatic extraction of trisyllabic verb phrases is designed by combining statistical models with linguistic rules. First, the candidate trisyllabic verb phrases are retrieved according to the verb phrases morphemes. Then filters by various statistical or rule-based methods are developed. The efficiency of this method are validated by the experiment.
  • Other Languages in/around China
    Miliwan xuehelaiti, LIU Kai, Turgun Ibrahim
    2015, 29(3): 201-206.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine translation from Chinese to Uyghur has substantial real applications. Focusing on the insufficiently addressed issue, this paper, proposes a novel Chinese-Uyghur translation method employing stems and suffixes in Uyghur are used as the basic translation unit. Based on the directed graph, this “stem-suffix” language model is proved to be significant better than previous word based models.