2016 Volume 30 Issue 2 Published: 20 April 2016
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    WU Yongxu, LV Xueqiang, ZHOU Qiang,GUAN Xiaoda,
    2016, 30(2): 1-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to solve the problems of chunk boundary identification and intra-chunk structure analysis, this paper explores a new chunk parsing task based on the Chinese concept compound chunk (CCC) scheme. After making detailed comparisons with previous base chunk and functional chunk schemes, the main parsing difficulties for CCC chunking are revealed. Therefore, the paper proposes a CCC parsing method based on the “shift-reduce” model. The experiments on the CCC bank automatically extracted from Tsinghua Chinese Treebank (TCT) show the feasibility of the method for parsing some simple CCCs, which facilitates further syntactic and semantic parsing on complex CCCs.
  • Language Analysis and Calculation
    LI Guochen, LIU Zhanpeng, WANG Ruibo, LI Jihong
    2016, 30(2): 12-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the unit of Chinese character, a neural network learning model for Chinese base-chunk identification is constructed. The model combines the neural network learning model of segmentation task with the model of base-chunk identification by using the hidden-layer features of segmentation. The sentence-level likelihood function for base-chunk identification task is employed as the optimization target. The parameters of the two learning model are trained in turn. The experimental results show that: 1) the F-score of base-chunk identification with sentence-level likelihood function is 1.33% higher than that with character-level likelihood function, and especially, the recall for the multi-characters chunk identification is improved as much as 4.68%. 2) The final result of using hidden-layer features of segmentation task is 2.17% higher.
  • Language Analysis and Calculation
    LI Sheng, KONG Fang, ZHOU Guodong
    2016, 30(2): 18-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic discourse processing is considered as one of the most challenging NLP tasks which is helpful to many downstream NLP tasks, such as question answering, automatic summary and natural language generation. Recently, the large scale discourse corpus PDTB is made available, which provides a common platform for discourse researchers. On the basis of PDTB corpus, the paper proposes an end-to-end explicit discourse parser with conditional random fields. The parser consists of three components joined in a sequential pipeline architecture, which includes connective classifier, explicit relation classifier and relation argument extractor. We report the performance on each component, and, from error-cascading perspectives, we analyses the parsers overall performance in detail.
  • Language Analysis and Calculation
    PAN Xiao, YU Zhengtao, GUO Jianyi, MAO Cunli, YANG Xiuzhen
    2016, 30(2): 26-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Chinese expert page disambiguation method based on feature mapping is proposed according to the characteristics of the Chinese expert page. Firstly, with the help of CRFs model, 12 predefined character attributes are extracted from the standard and the candidate page, and their weights are decided by a ME classifier. Then, the page similarity is calculated to decide if the candidate page attributes should be appended Experiments on NLP and ML expert pages show the effectiveness of the proposed method in disambiguation.
  • Seniment Analysis and Socilal Computing
  • Seniment Analysis and Socilal Computing
    ZHANG Peng, WANG Suge, LI Deyu,
    2016, 30(2): 32-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    The rapid development of Internet has built up a large number of cyber sources. This multi-lingual information come from a global environment with diversification. Considering the characteristics of cross-language sentiment identification, this paper proposes multi-strategy approach to perform cross-language sentiment analysis. The linguistic consistent sample and hybrid concept space are used to construct a bilingual cooperative framework and a sentiment feature mixture framework, respectively. Then results of tow framework are combined to decide the final sentiment label for a single sample. Experiments show that our strategy works well on cross-language sentiment analysis tasks.
  • Seniment Analysis and Socilal Computing
    HUANG Wei, FAN Lei
    2016, 30(2): 41-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, sentiment classification has become a hot research topic in natural language processing. In this paper, we focus on semi-supervised approaches for this issue. In contrast to the traditional method based on co-training, this paper presents a semi-supervised sentiment classification via voting based ensemble learning. We construct a set of diversified sub classifiers by choosing different training sets, feature parameters and classification methods. During each voting round, samples with highest confidence are picked out to double the size of training set and then to update the model. This new method also allows sub classifiers to share useful attributes sets. It has a logarithmic time complexity and can be used for non-equilibrium corpus. Experiments show that this method has achieved good results in the sentiment classification task with corpus in different languages, areas, sizes, and both balanced and unbalanced corpus.
  • Seniment Analysis and Socilal Computing
    DU Hui , GUO Yan , FAN Yixing , ZHANG Jin, YU Zhihua, CHENG Xueqi
    2016, 30(2): 50-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    Internet, with its freedom and richness, has become the most important channel of information dissemination. Hot topic mining benefits both policy making for government and business strategy adjustment for company. This paper presents an objective method to calculate topic popularity based on causal model by analyzing its influence factors. Data required by the algorithm is easy to obtain and considering panel data makes our algorithm more effective. Then we use multi-Gaussian curve to fit the movement of topic popularity which is useful for popularity prediction.
  • Seniment Analysis and Socilal Computing
    LIU Lie, XING Qianli, LIU Yiqun, ZHANG Min, MA Shaoping
    2016, 30(2): 56-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the popularity of social network sites (SNS) and the massive increase in SNS users, the behavior analysis of SNS users is of substantial importance in website maintenance, performance optimization and system upgrade. Its also a very important research area of network knowledge mining and information retrieval. For a better understanding of the user behaviors in adding tags for themselves in SNS, this paper analyses the distribution of user tags based on the data of about 2.63 million Weibo users. This paper investigates the macroscopic distribution characteristics of user tags, and the relation of tag distributions between a user and the people he follows. We reveal that when Weibo users add tags for themselves, they tend to use tags which can reflect their characteristics in the beginning, then, they tend to select popular tags out of a herd mentality. We applied research findings to a tag prediction algorithm based on following relationships, and the results prove that the correlation analysis provides certain reference significance to tag recommendation in social networks.
  • Seniment Analysis and Socilal Computing
    LU Xiao ,WANG Shuxin ,WANG Bin,LU Kai
    2016, 30(2): 64-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recommendation system based on users consumption data is playing an increasingly large application value in e-commerce, And in these data, businesses location information which can effectively reflect the users personal geographical preference, would make an important significance on recommender system. Existing work generally use only users review data as well as the distance between locations, which cannot reflect the relationships between different locations, not to mention that user preferences in different locations should has different weight. This paper proceed from the perspective of geographical area, and study the users preferences within the area, as well as the impact of different area partition methods on recommend models. Then we explore to incorporate recommender systems with geographical information effectively, including the locations global effects and users regional preferences, proposing recommendation models, such as LGE, LGN and LRSVD. Experimental evaluation on Yelp dataset demonstrates that our models can effectively improve the prediction results comparing to the traditional methods.
  • Seniment Analysis and Socilal Computing
    HUANG Junming, SHEN Huawei, CHENG Xueqi
    2016, 30(2): 74-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Understanding intrinsic mechanism of information propagations on social networks has attracted growing attention, including social network topology analysis and user behavior analysis. Due to the heterogeneity of links in social networks, only a portion of links significantly contribute to information propagations. The influence backbone of a social network, consisting of those links, might provide deeper insight to information propagations. Focused on the influence backbone, we analyzes the signs of links with social structural balance theory, and the roles of nodes with heterogeneous distributions of out-degrees, so as to find the roles played by links and nodes in information propagations in a microscopic. Furthermore, we investigate the network connectivity and information spread efficiency of the influence backbone, finding that information propagations are more fragile and less effective.
  • Seniment Analysis and Socilal Computing
    XU Danqing, LIU Yiqun, ZHAMG Min, MA Shaoping
    2016, 30(2): 83-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the large-scale social network dataset, this paper conducts a multi-feature statistical analysis on graph structure and finds that the indegree, outdegree and posts of social networks generally fit power law distribution. The “small-world” property makes the strongly connected structure of social network show the “spindle” shape. Furthermore, this paper incorporates users posting behaviors, browsing behaviors and social communities properties into social influence modelings. Experimental results show that the PTIM model combining users behaviors and link relationships has a stable performance on identifying the numbers of fans, authenticated users, the relative influence of users pairs and other indices.
  • Seniment Analysis and Socilal Computing
    LI Chao, ZHOU Tao, HUANG Junming, CHENG Xueqi, SHEN Huawei
    2016, 30(2): 90-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    The widely use of personalized recommender systems on online shopping websites results in great profits and enhanced user experiences. However, since a users behaviors usually scatter cross multiple different websites, it becomes difficult to provide accurate recommendations when a recommender system sees a section of his behaviors on a single website. We propose a new recommendation algorithm that transfers behaviors across different websites to calculate similarities between users on different websites. Our algorithm overcomes the sparsity and cold-start problem in recommender systems with a significant accuracy improvment, outperforming traditional algorithms that applied on a single website only.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    ZHUAN Yue, XIONG Jinhua, MA Hongyuan, CHENG Shuyang, CHENG Xueqi
    2016, 30(2): 99-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query in Chinese information retrieval system often contains Chinese, Chinese phonetic alphabet and English etc. Existing method can not solve the issue of mixed language and long Chinese query. In order to solve these problems, we propose a parallel query correction method for mixed language. The method establishes language model with mixed language and built the heterogeneous character dictionary tree according to the corresponding edit rules to process the query words. For the long Chinese query, we put forward spell correction model of two-way parallel. For paralle processing, we put forward the concept of reverse character dictionary tree and reverse language model. The training corpus used in the model is extracted from the user query log, click log, web links and other information. Experiment shows that the parallel query correction method for mixed language increases the accuracy by 9%, reduces the recall by 3%, and, especially, speeds up the processing by 40% compared to single pass query correction.
  • Information Retrieval and Question Answering
    ZUO Jiali, WANG Mingwen, WU Shuixiu, WAN Jianyi
    2016, 30(2): 107-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Models exploiting the position and proximity information of query terms in the documents improve the retrieval performance withits a high computation complexity. The paper presents an approximation method by compute the relevant degree of the sentence to query, resulting an information retrieval model combining sentence level retrieval. Experiment results show our model can get better performance than baseline models.
  • Information Retrieval and Question Answering
    CHEN Zhenhong , YU Xiaoming , LIU Yue , CHENG Xueqi
    2016, 30(2): 113-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Implicit relevance feedback has been used to improve the performance of the retrieve system. In Contrast to the recent, most of the related work focusing on implicit positive feedback, this paperinvestigated the usefulness of combining both implicit positive and negative feedback with time factor. The implicit negative feedback used in this paper is the unclicked document before the clicked document in the same search session. By estimating the time spent on the title and snippet of the unclicked document, the time factor is introduced to infer the relationships between users interest and behaviors. Thus, a unified time factor model called TIPNF is proposed to use both implicit positive and negative feedback to improve the performance of the retrieve system. Experiments on TREC Session 2011 and 2012 verify the effectiveness and stabilization of the TIPNF.
  • Information Retrieval and Question Answering
    CHENG Shuyang, XIONG Jinhua, GONG Shuai, CHENG Xueqi
    2016, 30(2): 121-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a probabilistic latent semantic indexing model based on query graph (GPLSI) to learn query features for query clustering in this paper. GPLSI for query-word co-occurrence and query-query co-occurrence simulates the generation of query intent and its representation based on query text, click and session information, and learns the probability distribution of query on different intents. Experimental results illustrate GPLSIs effectiveness in query similarity measurement and multi-intent query clustering.
  • Information Retrieval and Question Answering
    LIU Yaning, YAN Rui, YAN Hongfei
    2016, 30(2): 128-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic citation recommendation based on citation context is a highly valued research topic. The existing works all focus on the content based methods only. In this paper, we consider the citation recommendation as a content based analysis combined with personalization. Using users publication and citation history as the users profile and the language model, we propose a PCR (personalized citation recommendation) model. Experiment indicates 71.01% improvement of the performance in terms of recall@10 and 70.23% improvement in MAP compared with the traditional language model.
  • Information Retrieval and Question Answering
    LIU Wei, CHEN Xu, LIANG Yongsheng
    2016, 30(2): 136-141.
    Abstract ( ) PDF ( ) Knowledge map Save
    Internet labeling and tags have been used extensively to describe the image contents on the Web. To understand and utilize these tags for image semantic retrieval , this paper introduces a visual saliency model to emphasize the salient information, and then, extracts the visual feature to describe the similarity between images. At last, a novel random walk is proposed to balance the influences between the image contents and tags. Experiments show the effectiveness and feasibility of the proposed method when applied in image understanding and retrieval.
  • Information Retrieval and Question Answering
    WANG Dongsheng, WANG Shi, WANG Weimin, LIU Liangliang, FU Jianhui
    2016, 30(2): 142-152.
    Abstract ( ) PDF ( ) Knowledge map Save
    In QA system, the user queries are usually not isolated, but correlated. This paper proposes an ontology and semantic grammar based method for interactive question answering, and we developes a QA system called OSG-IQAs based on an existing non-contextual question answering system. We first propose a discourse structure to maintain semantic information (i.e., the understanding result) of questions, and then use an approach to recognizing the specific type of relevancy between the previous question and follow-up question. We then propose an algorithm which fuses different contextual information (recorded in discourse structure) into the current, follow-up question according to the specific relevancy type. Lastly, the transformed question is resubmitted to the non-contextual question answering system. We finally evaluate the proposed method on two real contextual QA data sets from two areas of different scales. The results show that the proposed method has better scalability; we achieved an overall performance better than a baseline system and almost the same performance as another comparison system whose contextual phenomena are manually resolved.
  • Information Retrieval and Question Answering
    CHEN Weizheng, YAN Rui, YAN Hongfei, LI Xiaoming
    2016, 30(2): 153-159.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel method to enhance graph-based multi-document summarization by incorporating Wikipedia entities. The Wikipedia contents of high-frequency entities are extracted and arranged as the document collections background knowledge. Then the PageRank algorithm is used to sort these sentences in the document collections and an improved DivRank algorithm is applied to sort the sentences both in the document collections and the background knowledge. Finally the summary sentences are chosen based on a liner combination of these two ranking results. Results of experiments on the data of document understanding conference (DUC) 2005 show that the method proposed in this paper can effectively make use of the Wikipedia knowledge to improve the summary quality.
  • Information Retrieval and Question Answering
    LI Lishuang, GUO Rui, HUANG Degen, ZHOU Huiwei
    2016, 30(2): 160-167.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an important branch of biomedical information extraction, Protein-Protein Interaction (PPI) extraction has great research significance. Currently, research of PPI mainly focuses on traditional machine learning, which requires the use of large amounts of annotated corpus for training and makes it costly to label the new data. This paper employs Transfer Learning in extracting PPI with a small amount of labeled data of target domain (in-domain), drawing support from annotated data of source domain (out-of-domain). To avoid the negative transfer caused by large differences between the distributions of different domains, we adjust the weights of each instance from source domain, depending on its relative distribution. Experiments on the AIMed corpus and on IEPA corpus reveals the efficiency of our alogrithems.
  • Information Retrieval and Question Answering
    XU Xia, LI Peifeng,ZHU Qiaoming
    2016, 30(2): 168-174.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently, semi-supervised or unsupervised event extraction remains a challenge. According to the nature of Chinese language, this paper proposes a dual-view-based bootstrapping approach to extract event patterns. According to a small set of seeds, it applies a cross filtering method to two views, document relevance and semantic similarity, and extract new patterns in each iteration. Our experimental results show our system outperforms the existed systems.
  • Information Retrieval and Question Answering
    LIN Hailun, XIONG Jinhua, WANG Bo, CHENG Xueqi
    2016, 30(2): 175-181.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Deep Web refers to the Web databases content hidden behind HTML forms, which can only be accessed by performing form submissions. The current web page collection technologies can not cover these resources effectively by employing only hyperlinks. For this purpose, this paper proposes an approach to crawling the deep web based on domain knowledge sampling. Firstly, it creates a domain attributes set using open source directory services and assigns the attributes based on a confidence function; Secondly, it uses the domain attributes set to select query interface and generate assignments, and finally, it selects the assignment with the highest confidence as a query instance for deep web crawling based on greedy algorithm. Experiments show that our approach can effectively collect the deep web resources.
  • Information Retrieval and Question Answering
    YU Jun , GUO Yan,ZHANG Kai, LIU Lin, LIU Yue, YU Xiaoming, CHENG Xueqi
    2016, 30(2): 182-188.
    Abstract ( ) PDF ( ) Knowledge map Save
    Structure-oriented web page clustering is one of the most important technique in web data mining. Previous traditional methods havent given a formal definition of the web page cluster center and have to calculate several point-wise similarities for the purpose of getting the similarity between a point and a cluster or the similarity between two clusters. The efficiency of these methods is much slower than the clustering algorithms using cluster center, especially they cant satisfy the need of large scale clustering in fast incremental web pages clustering. To solve these issues, this paper proposes a fast incremental clustering method FPC (Fast Page Clustering). In our method, a new approach is given to calculat the similarity between two web pages which is 500 times faster than the Simple Tree Matching algorithm; then a formal representation of web page cluster center is described and a Kmeans-like MKmeans(Merge-Kmeans) clustering algorithm for fast clustering is applied; Moreover, we use local sensitive hashing technique to quickly find the most similar cluster in a large scale cluster set and improve the efficiency in terms of the incremental clustering.
  • Information Retrieval and Question Answering
    CHEN Xiaorong, LIU Zuoguo
    2016, 30(2): 189-195.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper illustrates a distance oriented reorganization strategy in which clusters could be reorganized in independence from clustering process. The concept of Nearest Domain is proposed and Nearest Domain rules are elaborated. Then Gauss Weighing Algorithm is designed to re-wieght a text by the distance from cluster kernel. At last, Nearest Domain Weights will separates sparse clusters and adjusts abnormal texts while combines similar ones. Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.
  • Machine Translation
  • Machine Translation
    ZHENG Jianxi, BAI Yu, GUO Cheng, ZHANG Guiping
    2016, 30(2): 196-201.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on Wikipedia Cross-Lingual Link Discovery (CLLD) is to automatically identify an anchor text related to topic from source language Wikipedia articles, and recommend a set of relevant target language links to the anchor text. It involves three key problems: anchor text identification, anchor text translation, and target link discovery. To deal with the multiple target translations of an anchor text, we propose a context-based translation selection method, which uses a vote method based on pointwise mutual information (PMI). Experiments on the translation selection of person names, terminology and abbreviation in Chinese and English Wikipedia articles, the results show that the method achieves good performances.
  • Other Language in/around China
  • Other Language in/around China
    CAI Zhi-jie,CAI rang-zhuoma,
    2016, 30(2): 202-206.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan characters property is essential for Tibetan information processing, and it is substantial significance in education and scientific research. Because Tibetan characters writing is directed by both horizontal and vertical 1-7 Tibetan characters, the properties of Tibetan characters include the structure, length, frequency of Tibetan characters and the locality features of each characters. This paper establishes vector model (VMTT) of Tibetan characters, vector model (VMTS) and sparse-land model (SLM) of Tibetan character string, and conducts the component feature analysis of Tibetan characters based on these models.
  • Other Language in/around China
    Muhetaer Shadike, Buheliqiguli Wasili, LI Xiao
    2016, 30(2): 207-212.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we introduce two word segmentation methods for Uyghur key word search. They are realized in MATLAB code, and their performances are investigated on the same condition. At last gives some idea for optimizations.
  • Other Language in/around China
    Wangsiriguleng,Wang Chunrong,Siqintu,Arong,Yuxia
    2016, 30(2): 213-216.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mongolian is rich in morphological variation, especially for the verb. Based on a given Mongolian verb dictionary, we corrected the wrong verb form appeared in the end of hierarchical phrase based Chinese-Mongolian machine translation sentence. The experiments show that this method can improve the translation quality.