2014 Volume 28 Issue 2 Published: 10 February 2014
  

  • Select all
    |
    Language Analysis and Generation
  • Language Analysis and Generation
    PENG Weiming, SONG Jihua, YU Shiwen
    2014, 28(2): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper compares the Sentence-based DiagramTreebank with existing lexical specification in the aspect of word segmentation unit and POStagging, revealing the disjunction between automatic lexical analysis and parsing in the current Chinese information processing.It describes the parsing strategy of some special structures such as nonce formation and idiomsin the Diagram Treebank as well as their linguistics rationale. It also explores the implementation of the Chinese word classtheories such as “For All Words,the Word-class Is Based on the Sentence” and “Referentiality” in Chinese information processing.
  • Language Analysis and Generation
    WANG Qian, LUO Senlin, HAN Lei, PAN Limin
    2014, 28(2): 8-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to modern Chinese semantics, there are 4 semantic types (single, complex, compound and multiple). Attempted to capture the overall sentential semantic structures, sentential semantic type recognition is an important step to the whole sentential semantic structure parsing. This paper proposes a 4-semantic-types recognition method based on predicate and sentential semantic type chunk. This method firstly identifies some single semantic type sentences by the predicate number in each sentence. For the rest sentences, C4.5 algorithm is applied to get the maximum number of sentential-semantic-type chunk of predicates in sentential semantic structure, and then the sentential semantic type of each sentence is identified by combining the top sentence node in syntax structure. The experimental data contains 10221 sentences chosen from Beijing Forest Studio-Chinese Tag Corpus. The accuracy rate of sentential semantic type is up to 97.6% in open test.
  • Language Analysis and Generation
    CHE Tingting, HONG Yu, ZHOU Xiaopei, YAN Weirong, YAO Jianmin, ZHU Qiaoming
    2014, 28(2): 17-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    The functional connective is a word feature that directly expresses interior semantic relations, structure characteristics and the development trend of context of discourse units. Based on the functional connective, this paper puts forward a kind of methods for predicting relations of implicit discourse. First, this method mines functional connectives at the word and phraselevel, and divides the discourse relationcategory of functional connectives. Secondly, it buildsthe concept model for each type of functional connectives to describe argument attributes connected by functional connectives,and establishes the mapping system between argument concepts and discourse relations; Finally, the predictions of the implicit discoursesemantic relationis realized by statistical strategy to recognize conceptual model of argument and with “concept-relations” mapping system. The experimental results show that, the predicting method byconstructing concept model based on functional connectives, got the significant performance improvementscompared to the existing classification method based on supervised learning.
  • Language Analysis and Generation
    ZHANG Muyu, QIN Bing, LIU Ting
    2014, 28(2): 28-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse Relation is an important part of discourse semantic analysis. This paper analyses the differences between Chinese and English discourses, then presents the first Chinese discourse relation taxonomy based on the English discourse relation researches in details. Aiming at the rationality of the hierarchy, we conducts annotation experiments on Chinese internet news texts and analyses all difficulties happened during the data annotation together with the resolution to lay a foundation for the future discourse semantic analysis.
  • Language Analysis and Generation
    WU Zuoyan, WANG Yu
    2014, 28(2): 37-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new measure based on Hierarchical Network of Concepts(HNC) theory is put forward to compute the semantic similarityin natural language. Based on the coding rules and the map theory included in the concept expression form in the vocabulary relation level of HNC, the method integrates the concept of connotation, outward features, classification and combination of symbol to calculate semantic similarity. This method is compared with the current popular similarity methods based onHowNetaccording to the subjective judgment of human. Experiment showsthat the method has a good performance, which can distinguish the differences between different words more accurately.
  • Machine Translation
  • Machine Translation
    XIE Jun, LIU Qun
    2014, 28(2): 44-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dependency-to-String model makes use of translation rules based on head-dependents relations, which consists of a head and all its dependents. This model is good at capturing sentence patterns and phrase patterns in the source language, but fails in capturing non-compositional phenomena(such as idiom and collocation)that can be captured easily by phrases. In order to better improve the performance, we propose three ways to incorporate syntactic phrases, generalized syntactic phrases and non-syntactic phrases into this model. Experiments show that this model gains up to about 1.0 BLEU score by incorporating these three kinds of phrases.
  • Machine Translation
    LIU Ying, JIANG Wei
    2014, 28(2): 51-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper improves the HMM based word alignment by introducing syntactic knowledge. HMM is combined with English phrase structure tree distance to align Chinese-English words. Experiments shows that the improved HMM can reduce the error rate of word alignment, and improve the BLEU score of statistical machine translation.
  • Minority Language Information Processing
  • Minority Language Information Processing
    HUA que-cai-rang, LIU Qun, ZHAO Haixing
    2014, 28(2): 56-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a discriminative method for Tibetan part-of-speech tagging with perceptron training model. We focus on how to build the feature template that is in line with Tibetan lexical features, how to train discriminative models and the method of part-of-speech tagging. The method achieves an extremely high precision of 98.26% over a manually created test corpus, which shows that it’s a practical solution for Tibetan natural language processing.
  • Minority Language Information Processing
    SUN Meng, HUA que-cai-rang, CAI Zhijie, JIANG Wenbin, LV Yajuan, LIU Qun
    2014, 28(2): 61-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a discriminative model based approach for Tibetan word segmentation, which aims to investigate the influence of word-formation granularity and reranking strategy on the Tibetan segmentation. As for word-formation granularity, we discuss the influence of using the basic Tibetan character, the basic Tibetan character-tsheg as well as the syllable as word-formation unit on the Tibetan segmentation. The experimental results show that using syllable as word-formation unit obtains a highest F-measure score of 91.21%. And with a word lattice and shortest-path based reranking strategy, we further boost F-measure up to 96.25% on segmentation.
  • Minority Language Information Processing
    Maihefureti,Aishan Wumaier,Maierhaba Aili,Tuergen Yibulayin,ZHANG Jian
    2014, 28(2): 66-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we present spelling check method of Uyghur Languages based on a combination of dictionary and statistics. Firstly, we describe a stemming method based on the dictionary. Secondly, we proposed N-gram-based model to judge the suffix to a stem, detecting the misspelling unknown words at the same time. Finally, we present a spelling check method based a hybrid strategy of combining different methods. This method achieves improvements in accuracy, reliability, and so on.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    HE Linna, YANG Zhihao, LIN Hongfei, LI Yanpeng,TANG Lijuan
    2014, 28(2): 72-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    Drug name recognition aims to find drugs in biomedical texts, which is a demanding technology in face of overwhelming drug researches. We adopt a semi-supervised learning method to build a dictionary and then use the combination of the dictionary and the Condition Random Field method to recognize the drug name entities. Firstly, we extract a drug name dictionary using template matching method and then Feature Coupling Generalization (FCG) is used to filter the dictionary. Finally, we combine the dictionary and the Condition Random Field method to recognize the drug entities. As a result, our method achieved an F-score of 0.7673 on the drug name recognition corpus.
  • Information Extraction and Text Mining
    LIU Bingyang, LIU Qian, ZHANG Jin, LIU Xinran, CHENG Xueqi
    2014, 28(2): 78-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    Extracting new words from web texts is one key problem in the area of information processing with direct application in information retrieval, public opinion, dictionary compilation, Chinese word segmentation and other fields. A language-independent method is implemented to fast extract new words from web texts:Encoding multi-lingual texts into a uniform binary stream, extracting repeat strings, calculating the adjacency variety and string integrity measurement. Two suffix trees in 4-bit based structureare used to calculate these statistics in linear time. This method outputs new words and their order on both Chinese and English web texts.
  • Information Extraction and Text Mining
    LIN Liyuan, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2014, 28(2): 85-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion summarization aims to refine the text data so as to generate a summary regarding the expressed opinion. This study focuses on multi-document opinion summarization where the main task is to generate a summary for a given amounts of reviews towards the same product. We first collect and annotate a Chinese multi-document corpus on product reviews. Then, a novel PageRank framework to generate opinion summarization is proposed, with the advantage of considering both the topic relation and opinion relation among reviews. Empirical studies on the corpus demonstrate that the proposed method substantially outperforms existing approaches in terms of ROUGE measurement.
  • Information Extraction and Text Mining
    LIU Dandan, PENG Cheng, QIAN Longhua, ZHOU Guodong
    2014, 28(2): 91-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic information plays an important role in the semantic relation extraction between named entities. Taking “TongYiCi CiLin” as an example, this paper systematically investigates the effectiveness of lexical semantic information on tree kernel-based Chinese semantic relation extraction, particularly the influence of different levels of semantic information and polysemy phenomenon, as well as details about the redundancy between lexical semantic information and entity type information. The experiments of relation extraction on the ACE2005 Chinese corpus shows that semantic information can significantly improve the extraction performance without entity types, while in the case of known entity types, semantic information can also noticeably enhance the extraction performance for some relation types. This implies a certain degree of complementarity between “CiLin” semantic information and entity type information in Chinese semantic relation extraction.
  • Information Extraction and Text Mining
    YANG Xuerong, HONG Yu, MA Bin, YAO Jianmin, ZHU Qiaoming
    2014, 28(2): 100-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event relation recognition, as one of natural language processing technologies, faces information stream of texts detecting event relation. The key to event relation recognition is to detect latent logical relation (deciding whether events hold logical relation or not) between events by analyzing the corresponding discourse structure and semantic features of events, with the techniques of semantic relation recognition and inference. This paper proposes an event relation recognition method based on the inference of the event term and entityunder the same topic.Compared with the method based on dependency cue inference, theproposed method achieves 15.34% improvement.
  • Information Extraction and Text Mining
    ZHAO Hongyan,LIU Peng, LI Ru,WANG Zhiqiang
    2014, 28(2): 109-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recognizing text entailment is an effective solution to the natural language stating the same meaning in various ways. Although many text entailment recognition models have been proposed,the recognition accuracy rate is not satisfactory due to the complex factors in the text entailment.Treating the text entailment as a binary classification problem, this paper extracts multiple features of lexical, syntactic dependencies and FrameNet semantic knowledge to train a SVM classifiers for the text entailment recognition. Evaluated by the international RTE3 test set,this method achieves 78.1% precisionin in positive entailments,which is higher than the best result of RTE3.
  • Information Extraction and Text Mining
    LIU Zhenyan, MENG Dan, WANG Weiping, WANG Yong
    2014, 28(2): 116-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existing for feature selection methods are not appropriate for the skewed corpus in which most of samples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods select features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category frequency difference in whole dataset, and the category difference factor indicates relative documents frequency difference between classes. Then a new feature selection function called Relative Category Difference (RCD) is constructed based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.
  • Information Retrieval and Social Computing
  • Information Retrieval and Social Computing
    2014, 28(2): 122-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Collaborative Filtering (CF) could satisfy users preferences and provide personalized guidance. As the key techniques in current Internet recommendation engines, however, this technology suffers from severe sparse users ratings problem. Considering the plenty context information in users rating histories, this paper utilizes two kinds of context information to address sparsity issue: the effect of hierarchical structure on users potential preferences and the dynamic effect of users short term ratings. An integrated model CICF is then proposed based on the two of the features mentioned above. Experimental results on Yahoo! Music ratings show that CICF could significantly improve the predication performance compared to baseline method. Furthermore, it is also demonstrated that CICF could effectively mitigate rating sparsity issue.
  • Information Retrieval and Social Computing
    LIAN Tao, MA Jun, WANG Shuaiqiang, CUI Chaoran
    2014, 28(2): 129-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recommender system is an important tool to overcome information overload, where the most popular approach is collaborative filtering. This paper presents a mixture model for collaborative filtering named LDA-CF, which combines latent factor models and neighborhood methods. Firstly we convert the ratings matrix into a collection of pseudo-documents and utilize the LDA topic model to identify user and item latent factor vectors. Then we compute user-item similarities in the low-dimensional latent factor space. Finally we employ the neighborhood methods to predict unobserved ratings. Experiments on MovieLens 100k dataset demonstrate that LDA-CF outperformed neighborhood methods on the task of rating prediction in terms of MAE.
  • Information Retrieval and Social Computing
    LI Rui, WANG Bin
    2014, 28(2): 136-143.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, the development of the microblogging is impressive.The microblogging retrieval has become an important research topic.Microblog texts are short, quick updated, and circulated overthe social network, which makes themicroblogging search different from the traditional web search. In this paper, we first comparethe traditional vector space model, probabilistic model and the basic language model in microblog search.Thenwe proposeto expand the microblog textvia the author informationto improve the retrieval. Asfor the issue caused by theshort document occurred in the topic model training, we usethe author’s topic model to further extend the content of microblogging. Tested on the twitter data set,the results show thatthe proposed author modelcan improve the retrieval effects in microblogging search task.
  • Information Retrieval and Social Computing
    WAN Fei, ZHAO Xi, LIANG Xun, PAN Deng, NI Zhihao
    2014, 28(2): 144-150.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of mobile web, the mobile search engine users has been growing sharply. It’s of great significance to analysis users’ behavior toimprovethe mobile search engine and users’satisfaction. This paper selects the log of a mobile search engine in the first week of June 2011, and analyzes the mobile search engine user behavior. From perspectives of query word,session and clicks,we examinelength and frequency of query words,ratio of question query and URL query, number of queries in a session, modification of query words and click distribution. Furthermore, we compareresults of mobile search engine with internet search engine. These findingsare of substantial significance for the improvement and optimization of the mobile search engine.
  • Information Retrieval and Social Computing
    GU Zhiyu , QIN Tao, WANG Bin
    2014, 28(2): 151-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    The conversion-based advertising,which evaluateseffectiveness of an advertisement and chargesaccording to conversion occurred after a user viewed theadvertisement, leverages the unique power of Internet Advertising, and becomes the trend for future development of Internet Advertising. This paper introduces the scheme of the conversion-based advertising, analyzes its industrial application, and summarizes the researches on this field, including auction mechanism for CPA advertising, conversion rate estimation, conversion-based ad ranking, etc. Finally we analyze the existing problem and present the directions for further study.