2015 Volume 29 Issue 4 Published: 10 July 2015
  

  • Select all
    |
    Syntactic,Semantic Analysis
  • Syntactic,Semantic Analysis
    FANG Yan,ZHOU Guodong
    2015, 29(4): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional research in Chinese word segmentation focuses on identifying word boundaries, without considering the ambiguity of boundaries between Chinese words and phrases. In theory, linguists stick to their own view of word boundaries such that no uniform standard exists in Chinese word segmentation, and in practice, the corpus of various guidelines cannot bring satisfactory reusltsto wide applications. In this paper, we present a model based on cascaded CRF models to automatically parse internal structures of words, deciding both word boundaries and internal structures simultaneously with high precision. Compared with the traditional word segmentation methods, analyzing the structure of words is more consistent with the fact of fuzzy boundaries between Chinese lexical and syntactic units, solving the problem of inconsistent corpus standards and meeting different application requirements.
  • Syntactic,Semantic Analysis
    YANG Jincai, XIE Fang, WANG Zhonghua, HU Jinzhu
    2015, 29(4): 8-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Relation words are very important to the study of semantic relationships among clauses in compound sentences. Rule based relation word identification demands dynamic and constantly improved rules. This article investigates how to recognize the rule conflicts and solve them. Compound sentences have four kinds of rules: common rules, even words rules, common sentence pattern rules, and collocation patterns rule. This article gives a formal description of all the rules and the way of storing them, based on which we designed the flow of relation word detection, rule condition detection, result detection. A way of detecting the conflicts is given, include another two ways of solving the conflicts-priority mode and directed acyclic graph mode. With this proposed method, we have imported more than 1067 rules, with a correct rate of 100%.
  • Survey
  • Survey
    ZOU Bowei, ZHOU Guodong, ZHU Qiaoming
    2015, 29(4): 16-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Negation and speculation expressions exist extensively in natural language. Identifying and separating them from the reliable information have important value for many natural language processing tasks, such as information extraction, sentiment analysis, and text mining. Since the release of BioScope corpus in 2008, several large-scale evaluation conferences and workshops provided platforms for scholars to collect corpora, define tasks, and perform evaluations. Negation and speculation information extraction has gradually become a hot topic in nature language processing in recent years. This survey mainly introduces the research background, task definition, and corpora for negation and speculation information extraction. In addition, this paper also reviews and analyzes the present researches, and outline its developing trends.
  • Survey
    YANG Xuerong, HONG Yu, CHEN Yadong, YAO Jianmin, ZHU Qiaoming
    2015, 29(4): 25-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event relation detection is the task to detect the event relation from information stream of texts. Treating the event as the basic semantic unit, the relation type is determined by analyzing the feature of semantic relevancy between events. The event relation detection includes event relation identification (identifying whether the event pair is related or not) and event relation type decision (deciding which relation between relevance events, e.g. cause relation). In this paper, we try to establish a system of event relation detection in light of the concepts and data resources of discourse analysis, event extraction and scene understanding, covering the issues of the task definition, classification system of event types, corpora acquisition and annotations evaluation methodology, etc. Finally, we not only emphasize the analysis and comparison of the difference between event relation detection and discourse relation analysis, but also present the difficulty and challenge of the event relation detection.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LIN Liyuan, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2015, 29(4): 33-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion summarization aims to concentrate and refine the text data so as to generate a summary of the text regarding the expressed opinion. It helps users reading and understanding the content of the opinion text. This study focuses on multi-document opinion summarization where the main task is to generate a summary given amounts of reviews towards the same product. Opinion relevance is an important feature for opinion text, which is considered in our opinion summarization method. Meanwhile,users can better understand the objects that mentioned in the reviews by the help of high quality reviews or high credibility reviews, which is also considered in our method. We further collect and annotate an English multi-document corpus on product reviews. Empirical studies on the corpus demonstrate that incorporating opinion and quality information is effective for multi -document opinion summarization.
  • Sentiment Analysis and Social Computing
    GAO Kai, LI Siyu, RUAN Dongru, LIU Shaobo, ZHOU Erliang, QIAO Shiquan
    2015, 29(4): 40-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    The social network has become an effective platform to mine the society and public opinions. This paper proposes a sentiment analysis approach based on sentiment unit and opinion target. The extraction of sentiment unit and sentiment evaluation object is based on the co-occurrence probability. This paper also calculates sentiment degree of the sentiment unit. Experimental results validate the feasibility of the approach.
  • Sentiment Analysis and Social Computing
    HAO Zhifeng, DU Shenzhi, CAI Ruichu, WEN Wen
    2015, 29(4): 50-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Owing to informal words and expressions widely used in micro-blogs, target recognition for the sentiment analysis of microblogs is difficult, especially when the targets are not clearly mentioned. An improved conditional random fields model is proposed to deal with this issue, treating sentiment target extraction as a sequence-labeling problem. Through adding global nodes, the contextual information, syntactic rules and opinion lexicon are considered in the targets extraction. The major contribution of this method is that it can be applied to the texts in which the targets are mentioned in the sequence. Experimental results on the Sina microblog data demonstrate that this method outperforms the state-of-art methods.
  • Sentiment Analysis and Social Computing
    ZHANG Shaowu, YIN Jie, LIN Hongfei, WEI Xianhui
    2015, 29(4): 59-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an extension of the user influence research, micro-blog user influence mining is becoming a hot research issue. Based on traditional user influence measures, we propose three novel methods to mining micro-blog user influence in terms of the value of micro-blogging, the proliferation influence of message propagation and the user active level. Meanwhile, a user influence model including tweet influence, behavior influence, and activity influence is presented. Finally, for different influence indicators, we describe their internal relations with discussions for possible reasons.
  • Sentiment Analysis and Social Computing
    DAI Min, ZHU Zhu, LI Shoushan, ZHOU Guodong
    2015, 29(4): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion information extraction (OIE) is an important sub-task in the research on sentiment analysis. Currently, one pressing issue in Chinese OIE is that the Chinese corpus is not readily avalable. This paper focuses on the annotation framework for Chinese OIE, and constrcuts a Chinese corpus containing rich information. Specifically, in additions to the popular elements including sentiment orientation, opinion target and opinion keyword, our corpus contains the information of opinion target ellipsis, the expressing opinion without sentimental words and the sentimental polarity shifting. The statistics show the popularity and necessity of these special points (e.g., opinion target ellipsis) in Chinese texts.
  • Sentiment Analysis and Social Computing
    MENG Jiana, YU Yuhai, ZHAO Dandan, SUN Shichang
    2015, 29(4): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The accuracy decrease across different domains is commor in current sentiment analysis. To solve the problem, this paper presents a knowledge transferring approach based on the combination of the features and the instancetransfer. Firstly, the proposed approach builds the relevance of the domain dependent features between the source domain and the target domain via a tripartite graph so that a common semantic space is projected to rebuild the original vector space model. Then the proposed approach builds the relevance of the instances between the source domain and the target domain via a biased Markov model. This approach transfers sentiment analysis knowledge from the source domain to the target domain. The enhanced experimental performance confirms the effectiveness of the approach.
  • Sentiment Analysis and Social Computing
    NIU Yun, ZHANG Li, WANG Shihong,WEI Ou
    2015, 29(4): 80-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a weakly supervised sentiment analysis approach is proposed. A few words are collected to construct an initial sentiment lexicon. These seed words are used to mine potential sentimental words in the target text. In this process, linguistic features at multi-levels are explored and the role of the context is examined. The lexicon is expanded iteratively, and the final version is applied to classify the sentiment of a target document. Compared to results of previous studies on the same data, this approach achieves the best F-score while the constructed sentiment lexicon is rather small. The experimental results also show that this approach is robust when applied to a texts of different domains.
  • Sentiment Analysis and Social Computing
    LIU Shengjiu, LI Tianrui, ZHU Jie
    2015, 29(4): 89-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    Zipfs Law has been applied widely in many fields as an important rule in bibliometrics. Webometrics has received much attention with the accelerated explosion of network information nowadays. We suggest that Zipfs Law may exist in webometrics in the distribution of search result. We select the public word set and conduct experiments on several popular search engines. The experimental results confirm that the numbers of search results roughly conform to Zipfs Law. The Zipfs index of the numbers of search results of Baidu and So is 0.003.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    ZHANG Zhenzhong, SUN Le, HAN Xianpei
    2015, 29(4): 95-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query session detection is critical for query log analysis and user behavior characterization. It aims at identifying the consecutive queries submitted by a user for the same information need. Traditional query session detection methods are based on lexical comparisons, which often suffer from the vocabulary-mismatch problem(i.e, the topically related queries may not share any common words). To resolve the issue, this paper proposes a translation model based method for query session detection, which can model the relationship between words as word translation probability. In this way our method can capture the relatedness between queries even they do not share any common words. Furthermore, we also propose two approaches for generating training data from web query log for translation probability estimation. The first approach is based on time gap between queries and the second is based on the clicked URLs of queries. Experimental results show that our method can significantly outperform the baselines.
  • Information Extraction and Text Mining
    YANG Hua, JI Donghong, CHEN Bo
    2015, 29(4): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on relatively limited number of documents, undirected basic element networks (UBEN), in which nodes are header or modifier, are constructed. The connectivity of UBEN constructed on topic-related documents is investigated and the stopwords influence on connectivity is discussed. Furthermore, the connectivity difference between UBENs respectively constructed on topic-related documents and randomly-selected documents are contrasted. It is pointed out that connectivity of UBEN construced on topic-related documents are resulted from information fusion of the topic-related documents on some level, instead of from property of language only. This conclusion is of some significance for some natural language processing tasks, such as automatic summarization, information retrieval, etc.
  • Information Extraction and Text Mining
    ZHANG Yanxiang, PAN Haixia
    2015, 29(4): 111-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    Imbalanced data in text categorization is pervasive in reality. Conventional feature selection(FS) methods prefer to choose features in large classes rather than rare classes. This paper proposes a quantitative method to measure the dominance. Then, this paper dscribes a new FS method, namely DA method, based on category discriminative ability takes the minimum absolute difference of documental probability between classes as a criterion to partly ensure the fairness of FS method on large classes and rare classes. Experimental results show the DA method outperforms CHI, IG and DFICF especially on macro-average F1 measure.
  • Information Extraction and Text Mining
    LI Guohe,YUE Xiang,WU Weijiang,HONG Yunfeng,LIU Zhiyuan,CHEN Yuan
    2015, 29(4): 120-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature words selection from texts is a significant step in Chinese text information pre-processing. After the segmentation of Chinese texts, a Vector Model constructed by feature words representing the Chinese text documents cannot avoid low accuracy of document classification (or document retrieval) due to the sparseness and high-dimension of feature words. On the basis of an analysis of several classical text feature selection methods, a new method of text feature selection(DC) is presented, which is based on a modified document frequency. Experiments prove the performance of DC, is better than that of typical other methods according to macro-F values and micro-F values.
  • Information Extraction and Text Mining
    LIANG Hongshuo, LIU Yunqiao, ZHAO Li
    2015, 29(4): 126-133.
    Abstract ( ) PDF ( ) Knowledge map Save
    To avoid the repeated exhaustive search of the data in classical associative classification approaches, a knowledge evolutionary algorithm based on evolutionary epistemology is proposed. Firstly, data in the data set is encoded. Secondly, the hypotheses knowledge and inaccuracte knowledge are gained by conjecture and refutation operator. Thirdly, the coverage and accuracy of the hypotheses and inaccurate knowledge are calculated. Then, an extraction operator is used to extract rules from library of inaccurate knowledge and to put them into hypotheses library. Finally, the knowledge obtained with this method was used to build a classifier. In this way, the dataset can be read in a computer partly and the whole times used for read in and read out were reduced largely. The results have shown that knowledge evolution algorithm can speed up the calculation process under the guarantee of similar accuracy of classification.
  • Information Extraction and Text Mining
    ZHANG Zhilin, ZONG Chengqing
    2015, 29(4): 134-143.
    Abstract ( ) PDF ( ) Knowledge map Save
    Micro blog, a new information-sharing platform, is now playing an important role in people’s daily live with the rise of Web 2.0. And micro blog sentiment analysis research also attracts more attention in recent years. This paper provides an in-depth analysis on the difference of feature representation and feature selection between the traditional sentiment classification and micro blog sentiment analysis. To avoid the drawbacks of feature selection of existing methods, we propose three simple but effective approaches for feature representation and selection, including the lexicalization hashtag feature, the sentiment word feature, and the probabilistic sentiment lexicon feature. Experimental results show that our proposed methods significantly boost the micro blog sentiment classification accuracy from 73.17% to 84.17%, outperforming the state-of-the-art method significantly.
  • Language Transcription Processing: Model and Application
  • Language Transcription Processing: Model and Application
    ZHANG Xiaoheng
    2015, 29(4): 144-150.
    Abstract ( ) PDF ( ) Knowledge map Save
    A duplicate-encoded character is a character which has been assigned two or more code points in a coding system such as Unicode. When output in distinct codes, the glyphs of a duplicate-encoded character appear the same to human users, while in the computer, they are different characters. Such a human-computer inconsistency would cause confusion in language information processing, resulting in incomplete information retrieval, inaccurate statistic calculation, and inferior quality of data sorting and categorizing. This paper discusses duplicate encoding of Chinese characters in Unicode, MS Office and the WWW, including (a) duplicate encoding arising from new code assignment in the Unihan public area to characters already encoded in the private use area, (b) duplicate encoding caused by compatibility encoding, (c) duplicate encoding brought forward by building dedicated lists for CJK strokes and radicals, and (d) duplicate encoding of characters in half-width and full-width forms. Some effective solutions to the problems are also suggested.
  • Language Transcription Processing: Model and Application
    LIN Xinjian,TANG Xianghong,WANG Jing
    2015, 29(4): 151-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    From the perspective of communication encoding, a reversible text watermarking algorithm based on coding and synonymy substitution is discussed. The algorithm employs interchangeable synonyms as signs to group the texts and generates watermarking by extracting group text feature. The algorithm uses the method of Huffman coding to encode synonyms and uses the method of error correction coding to encode the position of a synonym in the thesaurus into , then completes the watermark embedding combined with synonymy substitution. At the receiving end, using packet text feature and the Hoffman code to locate tampered watermark text and using error correcting codes to restore the original synonymy. Experimental results show that, the proposed algorithm can improve the robustness and imperceptibility of watermarking. In addition, it can locate the tampering and restore the original synonymy.
  • Language Transcription Processing: Model and Application
    DENG Xiaojian, LI Bin, ZHANG Junsong
    2015, 29(4): 159-165.
    Abstract ( ) PDF ( ) Knowledge map Save
    A method of finding the visual center of gravity from a Chinese character is presented in this paper. Firstly we collect some Chinese character samples, and further extract visual balance center of each Chinese character. Then we mark visual center of gravity of the sample characters; ultimately construct a relationship model between the connected regions visual balance center and visual center of gravity of Chinese characters based on statistics. The proposed method has many potential applications, such as feature extraction, designation and optimization of Chinese characters.
  • Machine Translation
  • Machine Translation
    LUO Ling, CHEN Yidong, SHI Xiaodong, SU Jinsong
    2015, 29(4): 166-174.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese idioms are frequently used in all kinds of Chinese texts. However, since Chinese idioms are relatively sparse in most training corpora for Chinese-English SMT systems, translation quality of the idioms are not satisfactory. And to the best of our knowledge, there is very little research on handling the translation of Chinese idioms. This paper proposes two methods to improve the translation of Chinese idioms by paraphrases in Chinese-English SMT. In the first method, we paraphrase the Chinese idioms in the test set, while in the second method, we paraphrase the Chinese idioms in the training set. The experimental results show that both methods could significantly improve the performance of the Chinese-English SMT system.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    WANG Shuxin, WEI Bingjie, LU Xiao, WANG Bin
    2015, 29(4): 175-182.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog search has become a hot research problem in information retrieval area in recent years. Related work shows that most queries in microblog search are time-sensitive. To address this problem, many existing methods were proposed based on different time-sensitive assumptions, such as, “the newer of a document, the more important it is” or “the closer to the peak point a document is, the more important it is”. All these methods have improved retrieval effectiveness somehow. However, it is hard to summarize the temporal role in ranking of microblog search to one straight forward assumption as above. In this paper, our study on temporal distributions of relevant documents of different queries shows the complexity of temporal role in ranking; therefore, simple straight forward assumptions are not accurate. We proposed to use the temporal and entity evidences of query-document pairs to train a time-sensitive learning to rank model to tackle this problem. As for temporal features, both global features of query and local features of query-documents pair are extracted. Experimental results show that TLTR significantly improves the retrieval effectiveness over existing time aware ranking models on TREC Microblog Track 2011—2012 data set.
  • Information Retrieval and Question Answering
    GONG Xiaolong, WANG Mingwen, WAN Jianyi, WANG Xiaoqing
    2015, 29(4): 183-191.
    Abstract ( ) PDF ( ) Knowledge map Save
    In most existing retrieval models, the calculations on the relevance between the document and the query are based on the statistical features, such as within-document frequencies, inverse document frequencies, document lengths and so on. Recent studies show that the term position information can promote the precision of the query results, but how to best employ this information remains an open issue. This paper proposes to integrate the terms proximity information into the semantic positional language model(SPLM), with a Dirichlet prior distribution as smoothing measure to compute proximity. The proposed semantic positional language retrieval models with a proximity information performs better than classical semantic positional language model in the experiments.
  • Information Retrieval and Question Answering
    ZHANG Keliang, LI Weigang, WANG Huilan
    2015, 29(4): 192-198.
    Abstract ( ) PDF ( ) Knowledge map Save
    We design and implement an ontology-based QA system for aviation domain, It adopts the approach of classifying questions with the help of domain-specific ontology and obtains structural semantic information for the question. Then the question is converted into SPARQL query for answer extraction. The experimental result shows that the system can understand most of the questions with a precision rate of 82.97%.
  • Other Languages in/around China
  • Other Languages in/around China
    Gulnur Arkin, Zulpiye Aman, Dilmurat Tursun
    2015, 29(4): 199-206.
    Abstract ( ) PDF ( ) Knowledge map Save
    A statistical analysis is carried on 93 complete/incomplete vowel harmonious words chosen from 333 tri-syllabic words in Uyghur Language Acoustical Database. It is focused on the basic acoustic features of tri-syllabic complete/incomplete harmonious words including the broadband resonance peak mode, resonance peak value, vowel duration, vowel pitch and the sound intensity. The rules and findings are of great importance for adjusting vowel harmony in synthesis sub-procedure of parametric or waveform concatenation based speech synthesis system.