2016 Volume 30 Issue 1 Published: 15 January 2016
  

  • Select all
    |
    Review
  • Review
    LIU Bingquan, XU Zhen, LIU Feng, LIU Ming, SUN Chengjie, WANG Xiaolong,
    2016, 30(1): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Community-Based Question Answering Portal (CQA) has been very popular recently, which provides a platform for users to share knowledge or to seek information and accumulated abundant QA (Question and Answer) pairs. Recently, there are many achievements in question search, expert finding and content quality evaluation of CQA. Especially the research of content quality transforms from answer quality evaluation to answer summarization, which can promote answer quality from the aspect of integrity. This paper surveys the motivation and task of answer summarization, reviewing the most relevant approaches and principal techniques of answer summarization.
  • Review
    XI Jiazhen,GUO Yan,LI Qiang,ZHAO Ling,LIU Yue,YU Xiaoming,CHENG Xueqi
    2016, 30(1): 8-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the ever-growing short content web pages, this paper puts forward to first classify the web pages into two typesshort content pages and long content pages. Then, an algorithm for content extraction from short content web pages is designed by combining DOM tree depth and text density.
  • Review
    LIU Qian, LIU Bingyang, HE Min, WU Dayong,LIU Yue,CHENG Xueqi
    2016, 30(1): 16-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity attribute extraction is fundamental to information extraction and knowledge base construction. This paper proposes an approach to open-domain entity attributes extraction from the online encyclopedia. The method collects potential attribute phrases through a combination of the web page structure and the domain independent patterns. Then, the acquired attribute patterns are expanded by synonymous expansions, which in turn helps to obtain a set of synonymous attributes. Experimental results show that the proposed approach boosts the coverage of extracted attributes without losing the precision.
  • Review
    GUO Jianyi, CHEN Peng, YU Zhengtao, XIAN Yantuan, MAO Cunli, ZHAO Jun
    2016, 30(1): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a composite kernel approach to Chinese semantic relation extraction by a composite kernel. This paper designs an improved training matrix by using the mathematical properties of radial basis kernel in order to make vectors disperse in the training matrix, then integrate this kernel with the polynomial kernel and the convolution tree kernel. It enumerates for the best parameters of the composite kernel function for Chinese semantic relation extraction. Experimented on the tourist domain texts, the porposed method out-performs methods of single kernel as well as a traditional composite kernel.
  • Review
    DU Yufeng, JI Duo, JIANG Lixue, ZHANG Guiping
    2016, 30(1): 30-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a metric for patents similarity based on Subject-Action-Object(SAO) structure. In contrast to the traditional approach based on key-words, this method captures the patent structure and consider the relationship among patents. To extract the SAO triple, this paper applies OLLIE, the latest entity information extraction tool, into the patent field. In addition, this paper investigates into the action element, outlining the structure of action element. Finally, this paper combines the SAO structure with the VSM module to calculate the patent similarity, achieving an improvement on the pure VSM based approach.
  • Review
    LI Haorui, WANG Jian, LIN Hongfei, YANG Zhihao, ZHANG Yijia
    2016, 30(1): 36-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word sense ambiguity challenges the trigger detection in biological event extraction. This paper proposes a hybrid method combing different learners trained with rich features to deal with word sense ambiguation for trigger detection. Specifically, we address the trigger detection by assigning an event types to each token, adopting a multi-class SVM classifier and Random Forest. Experiments on the BioNLP 2009 shared task dataset show that this method achieved a good performance.
  • Review
    XU Jiajun, YANG Yang, YAO Tianfang, FU Zhongyang
    2016, 30(1): 43-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    As the information on the Internet blooming, it becomes a valuable question that how to analysis and extract the useful information from the large-scale data. As a kind of online corpora, the forum corpora and microblog corpora bringnew difficulties with the complex structure and content in very brief text. This paper proposes a method to use LDA to model corpus and extract topics from it. Then we find support documents and estimate the topic strength, which is cast on the timeline to reflect the topic tendency. Finally, we re-model the copus within each period, reflecting the variation of each topic in contrast to the overall topics. The experimental result has shown that the above approach is reasonable and effective.
  • Review
    GUO Cheng, BAI Yu, ZHENG Jianxi, CAI Dongfeng
    2016, 30(1): 50-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the vagueness and ambiguity in the user queries, this paper proposed an unsupervised approach to subtopic mining for user intents. Firstly, this method uses ATF × PDF model to extract candidate topic words in the search results. Then, it groups the latent topics via hierarchical clustering based on the HowNet semantic similarity. Finally, the method employs the LCS algorithm to generate diversified subtopics. The experimental results show an average score of 0.573 according to D#-nDCG@10index.
  • Review
    LIAO Yanan, WANG Mingwen, ZUO Jiali, WU Genxiu,GAN Lixin
    2016, 30(1): 56-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    The information retrieval usually can be improved by combining more information mined from the retrieval process. To fully take advantage of the existing queries correlation information, terms and documents for query expansion and reconstruction, we propose an information retrieval model based on multilayer Markov network. The Markov network is constructed by the correlation of query network, term network and document network. A clique model is further designed to speed up the computation. The experiments on the standard data sets have indicated that our model can integrate information of three aspects effectively for an improved effect of retrieval.
  • Review
    ZHU Yadong,,GUO Jiafeng, LAN Yanyan, CHENG Xueqi
    2016, 30(1): 63-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    In a result cache, either document identifiers (docID cache) or the actual HTML pages (page cache) can be stored to accelerate the response speed. For a fixed memory size, the docID cache can achieve a higher hit ratio while the page cache can obtain higher response speed. This paper proposes a novel hierarchical result caching scheme based on temporal and spatial locality, in which the result cache is firstly split into two layersa page cache and a docID cache. In our scheme, page cache will be the first choice for answering some queries, and then the docID cache. In terms of average query response time, the results show that the proposed approach achieves a substantial performance improvement than baseline method by 9% on average, and up to 11% in the best situation. Secondly, the scheme also designs an adaptive prefetching strategy based on docID cache. The experiments show that the proposed scheme combined with the prefetching strategy can lead to an additional performance improvement. And we finally build a complete and effective result caching scheme by capturing the temporal and spatial locality of user search behaviours.
  • Review
    HU Yi, LIU Yunfeng, YANG Haisong, ZHANG Xiaopeng, DUAN Jianyong, ZHANG Mei, QIAO Jianxiu
    2016, 30(1): 71-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The focus of this paper is to deal with the problem of Chinese query correction in a real world search engine. A wrong query usually confuses a search engine. We propose a complete approach to correct Chinese query in our search engine, which includes query candidates creating, query candidates evaluation and ranking by kernel based methods. After being experimented in the test set through precision/recall performance and proved in our search engine via DCG performance, the approach achieves good effects.
  • Review
    HU Yi, LIU Yunfeng, DUAN Jianyong, XIONG Zhanzhi, QIAO Jianxiu, ZHANG Mei
    2016, 30(1): 79-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    The time-sensitive of queries in web search refer to the requirement of news webs. This time-related factor is used to balance the other factors in the ranking of webs to satisfy users search needs. In this paper, the author presents a computing model for time-sensitive of queries by modeling users search behaviors and the media reports, separately. Then, these two kinds of sub-models are combined to compute final time-sensitive scores of queries in the searching process. The time-sensitive scores give the ranking a quantified evidence to boost or reduce the weights of news webs and, further, provide supports for special news information box appeared on the result page after searching. The proposed model yields satisfactory performances and effective feedback from users in both artificial and clicks through rate experiments.
  • Review
    HUANG Shanshan, MA Jun, GUO Lei, WANG Shuaiqiang
    2016, 30(1): 85-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Collaborative filtering (CF) is one of the most popular recommendation techniques in application. However data sparsity and cold start remain as two challenges in CF applications. Since Linked Data integrates rich and structured features about entities, this paper proposes the idea of leveraging Linked Data to improve CF recommendation algorithm. Based on matrix factorization (MF), we develope a novel CF model denoted as LinkMF, incorporating structured Linked Data to mediate data sparsity and cold start problems while preserving recommendation accuracy. In particular, we extract features from the Linked Data and construct the item profiles; then we propose new similarity metrics for dufferent items; and, finally, we incorporate the obtained item similarities into the basic MF model to constrain and improve the factorization process. Comprehensive experiments on MovieLens and YAGO benchmark datasets demonstrate the promising results of the LinkMF compared with the state-of-the-art CF models, especially in the data sparsity and cold start scenarios.
  • Review
    GU Wanrong, DONG Shoubin, ZENG Zhizhao, HE Jinchao, LIU Chong
    2016, 30(1): 93-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    News recommendation is one of the most popular research issues, which is often realized on the log of users’ behaviors. However, many news sites couldn’t force users to register before browsing news articles. As a mainstream self-media form, the Micro-blog is rich in individual tweets or retweets. In this paper, we propose a novel personalized news recommendation based on micro-blog user profile, which classifying news items and analyzing micro-blog for user profile construction. The experimental results show that our system has better efficiency and practical effect compared with the state-of-the-art algorithms.
  • Review
    LI Guochen,LV Lei,WANG Ruibo,LI Jihong,LI Ru
    2016, 30(1): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an approach to label the semantic roles automatically by using a lexical resource named Tongyici Cilin, in which a CRFs model is constructed by a series of new features derived from the encoded information of Cilin. Compared with the features of word, part-of-speech and word positions, the proposed method investigates the Cilin features on the corpus of Chinese FrameNet (CFN), developed by Shanxi University to describe semantic knowledge. Experimental results show a significant improvement in the performance after adding the features of Cilin information.
  • Review
    CHEN Sufen, ZENG Xueqiang
    2016, 30(1): 108-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the data mining of large-scale and streaming text data, incremental dimension reduction is an essential technique. As a state-of-the-art solution, Candid Covariance-free Incremental Principal Component Analysis (CCIPCA) applies an approximate centric alignment on the input data, where only the current sample is centred but all historical data are not updated properly. In this paper, we propose a Centred Incremental Principal Component Analysis (CIPCA) algorithm with exact historical mean update. Compared to CCIPCA, the proposed method not only correctly centered the current sample, but also correctly update all historical data by the current mean. The experiments on text streaming dataset show that CIPCA converges more quickly with the data flows in, and the performance improvement is especially obvious when the datas inherent covariance is not stable.
  • Review
    XU Fan, WANG Mingwen, XIE Xusheng, LI Maoxi, WAN Jianyi
    2016, 30(1): 115-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an unsupervised theme-rheme structure theory based discourse coherence model, in contrast to the current supervised entity based model and the discourse relation grid based model. Our model describes discourse coherence via calculating the similarity between theme or rheme of adjacent sentences through incorporating more semantic knowledge like word stem, hypernym, hyponym, synonym and paraphrase etc. Meanwhile, this paper also presents a simple and effective coherence model based on counting the number of discourse relations within a discourse, and integrates the theme-rheme-based model using linear combination method. Evaluation on benchmark English student essay dataset reveals the effectiveness of our linear combination discourse coherence model, significantly outperforming baselines the literature.
  • Review
    BAO Feilong, GAO Guanglai,WANG Hongwei
    2016, 30(1): 124-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    To improve in-vocabulary performance in Mongolian speech keyword spotting task, we propose a Mongolian speech keyword spotting method by searching the stem according to the characteristic of Mongolian word-formation rule. First, Mongolian speech is decoded to lattice file by Segmentation-based LVCSR system, and this lattice file is converted to a confusion network. Then, we detect the keywords according to their stems among the confusion network. Experimental results show that the proposed method outperforms baselines based on word confusion network.
  • Review
    WANG Chengping
    2016, 30(1): 129-133.
    This paper describes the design of Yi language corpus database on SQL Server 2008. This system can store the Yi language corpus U (Unicode Yi characters) and Y (YIWIN Yi characters) automatically. A C/S style access module is also implemented, which enable remote access via web browser. The report practice can contribute to similar tasks in other minority languages.
  • Review
    MA Zhiqiang, ZHANG Zeguang, YAN Rui, LIU Limin, FENG Yongxiang, SU Yila
    2016, 30(1): 133-140.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid increasing of Mongolian texts on the Internet, it is of practical significance to identify them before further processing. This paper proposes an average distance recognition algorithm based on N-Gram model, and an experimental platform is established. Experimental results show that the presented algorithm can identify Mongolian text from Chinese, English, or even mixed-language texts, with an accuracy of above 99.5%.
  • Review
    LUO Yawei, TIAN Shengwei, YU Long, Turgun·Ibrahim, Askar·Hamdulla
    2016, 30(1): 140-148.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional research on sentiment analysis is to determine the sentiment of word, sentence or the whole text, ignoring the topics involved in the sentimental expressions In contrast, this paper proposes a method based on cascade CRFs model to analyze the sentiment at claim level of Uyghur opinioned text. The first layer extracts the topic word and its corresponding opinion word, and determines the scope of opinioned claim, and the result is then passed to the second layer as one of the key features which contributes to sentiment analysis at the claim level. The goal of the sentiment analysis on fine-grained opinion mining is to build a quadruple, which is <opinioned claim, topic word, opinion word, sentiment>. Our experiments show that the precision rate and the recall rate of sentiment analysis reach 77.41% and 78.51%, respectively, demonstrating the efficiency of the proposed method on fine-grained sentiment analysis.
  • Review
    WANG Huiyun, YU Long,*, TIAN Shengwei, Jiamila Wushouer, FENG Guanjun
    2016, 30(1): 148-156.
    Abstract ( ) PDF ( ) Knowledge map Save
    The identification of comparative sentences and the extraction of comparative relations are of substantial significance to fine-grained opinion mining. This paper outlines the famework of Uyghur comparative sentence identification, and proposes a two level identification model. A Bidirectional CSR Mining algorithm(Bi-CSR) is designed to mine sequential patterns, then the SVM classifier is applied to classify a Uyghur sentence into either “comparative” or not. The experimental results demonstrate the effectiveness of the proposed method.
  • Review
    LIU Wei, LI Hecheng
    2016, 30(1): 156-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since the hand-written Uyghur characters can be dramatically changed in its aspect ratio, a single template normalization can not effectively increase the differences of characters in different classes. This paper proposes a multi-template normalization algorithm to deal with the shape characteristic of Uighur characters. In the training stage, features of characters are extracted with multi-template normalization for the training of different classifier. In the recognition stage, the divergence direction of main strokes is chosen to decide the best template, and then the features of normalized characters are extracted for the corresponding classifier. The experiment results show that the multi-template normalization algorithm has better recognition performance than the single template baselines.
  • Review
    LOU Xinyan, LIU Yang, YU Xiaohui
    2016, 30(1): 162-170.
    Abstract ( ) PDF ( ) Knowledge map Save
    Session identification has attracted much attention since it can provide insight into the behavior patterns of users. A traffic data session is a sequence of crossroad passed by a user to achieve a certain task. In this paper, timeout method as well as statistical language model are utilized to identify sessions. The timeout method deals with the effect of time interval between neighbor crossroads on session identification, while the statistical language model considers the global regularity of crossroad sequences. Extensive experiments are conducted and the results indicate that the influence of time factor is larger than global regularity on session identification.
  • Review
    YANG Ya, YANG Zhihao, LIN Hongfei, GONG Bendong, WANG Jian
    2016, 30(1): 170-176.
    Abstract ( ) PDF ( ) Knowledge map Save
    Biological named entity recognition aims to find the name of the specified type in biomedical texts. At present, the research on entity recognition has been grown sharply because it is of particularly importance to extract biological entity information automatically from huge amounts of biomedical text. In this paper, we present a multiple entity recognition system in biomedical literature named MBNER (Multiple Biomedical Named Entity Recognizer), which can recognize gene (protein), drug and disease entities from biomedical texts simultaneously. This system can achieve an F-scores of 0.890 5, 0.767 3 and 0.901 2 on the gene (protein), drug and disease datasets, respectively. In addition, the system can visualize the recognition results of different named entities.
  • Review
    LI Zongyao, YANG Zhihao, WU Xiaofang, LIN Hongfei, GONG Bendong, WANG Jian
    2016, 30(1): 176-183.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays, the amount of biomedical literatures is growing at an explosive speed, and there is a lot of useful information undiscovered in these literatures, e.g. researchers can form biomedical hypotheses through mining these literatures. However, the popular mining solution based on co-occurrence produce too many target useless concepts. This paper presents a mining method based on semantic resources, i.e. using SemRep system to extract relationships between entities within the sentence. By employing a combination of the semantic type, concept information amount and association rules, we can effectively filter the linking and target concepts and sort the target concepts with the statistical information. The experimental results demonstrate that our method works well on the classic cases found by Swanson.
  • Review
    GUO Yanwei, WU Yuexin, ZHAO Xin, YAN Hongfei, HUANG Jianxing
    2016, 30(1): 183-190.
    Abstract ( ) PDF ( ) Knowledge map Save
    The task of user churn prediction is a research issue in many fields. Currently the available solution usually built uopna classification models. For the online games which is developing rapidly, the churn prediction is not well addressed yet. This paper chooses certain online game user logs and analyzed user behaviors, finding significant differences in game investment, interests in lottery and player interaction between churn users and normal users. This paper also suggests that there are such challenges in online game data processing as the unbalanced data, the huge candidate features, the interference differences and so on. This paper also discusses the direction when selecting features, as well as the key role of normalization and alignment in feature processing. Experiments prove that the features selected by this paper are informative.
  • Review
    HAO Xiulan, XU Fangqu, JIANG Yunliang
    2016, 30(1): 190-198.
    Abstract ( ) PDF ( ) Knowledge map Save
    An approach is introduced to acquire fake Chinese reviews semi-automatically. It mainly includes a platform to get fake reviews, a syntactic parser, and a sentiment analysis component. Emphasis is on a syntactic based sentiment pair extraction, <comment object, comment phrase>. Finally, we analyze some experimental results and give some suggestions to improve the accuracy of sentiment analysis.
  • Review
    HU Hongsi, YAO Tianfang
    2016, 30(1): 198-204.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a method that can extract aligned sentences for comparable corpus derived from Wikipedia. First, we retrieve Wikipedia data dump of English and Chinese and re-construct it into a local Wikipedia corpra database. Second, we extract the lexicon, esp. the build named entity lexicon, and obtain bilingual comparable corpus through alignment mechanism provided by Wikipedia. Third, we analyze the characteristics of Wikipedia corpus manually, and designe a series of features. Adopting a taxonomic of alignment/partial alignment/non-alignment, we finally apply SVM classifier to identify the sentences alignment. Experimented on the Wikipedia corpus and a third-party parallel corpus, the proposed method achieves the precision of 0.82 and 0.92, respectively.
  • Review
    HE Gang, LV Xueqiang, XIAO Shibin, WANG Fan
    2016, 30(1): 204-210.
    Abstract ( ) PDF ( ) Knowledge map Save
    Term categorization plays an important role in domain ontology construction and domain vocabulary collection. To deal with the misclassified terms in the conceptual knowledge element library of CNKI, this paper proposes a paraphrase-expanded method to categorize terms. This approach introduces the idea of paraphrase expansion as well as the term-related knowledge obtained via web search to reconstruct the term vectors. The final cauterization is decided by the vector distance between a term vector and the class central vectors. The overall precision reaches 73.32%, indicating nearly 10% relative improvement compared with the original method without expansion.