2009 Volume 23 Issue 2 Published: 15 April 2009
  

  • Select all
    |
    Review
  • Review
    ZHAO Jun
    2009, 23(2): 3-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and cross-lingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and cross-lingual coreference resolution are far from the requirement of large-scale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and cross-lingual conference resolution should be carried beyond the natural language texts and should be investigated directly among the large-scale, redundant, heterogeneous, ill-formed and noisy web pages.
  • Review
    XIN Xiao, FAN Shi-xi, WANG Xuan, WANG Xiao-long
    2009, 23(2): 18-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents three algorithms for dependency parsing based on the Maximum Entropy Models. The Maximum Spanning Tree (MST) algorithm achieves the best result. The target of MST is to find a Maximum Spanning Tree in a directed graph. Each edge of the directed graph corresponds to a dependency relation of the dependency parser, and the weights of the edges are obtained by using a Maximum Entropy Model. The training and test data sets are the CoNLL2008 share task corpora. The system achieves F1 scores of 87.42 and 80.8 for WSJ and Brown test data respectively, ranking sixth among all the competition teams.
  • Review
    LI Bo, GAO Wen-jun, QIU Xi-peng
    2009, 23(2): 23-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a new method to construct the answer ranking model for Question Answering System. The method leverages knowledge density-based features used in answer ranking and introduces a new feature--syntactic path--by using parsing analysis and establishes an evaluation function by using supporting vector machine regression model. The experiments show that the new model which involves the syntactic path feature achieves substantial improvements.
  • Review
    ZHANG Yu, ZHAO Xin, LIU Ting
    2009, 23(2): 28-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question understanding is an important part of Question Answering system, especially to the Cooperate Question Answering system in which questions provided by users are described in detail. This paper proposes an algorithm that combined dictionaries and paring to exploit these crucial questions narrations by extracting certain valuable keywords. Experiments show that our approach substantially improves the MPP and MAP of question answering system.
  • Review
    LI Ru, WANG Wen-jing, LIANG Ji-ye, SONG Xiao-xiang, LIU Hai-jing, YOU Li-ping,
    2009, 23(2): 34-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Taking advantage of the semantic expression in Chinese FrameNet (CFN), this paper discusses the construction of the domain specific Chinese FrameNet semantic base using owl, and validate and analyze its effectiveness by the design of Question Answering System in the transportation domain. In the proposed QA system, the query questions are first classified by a combination of the TREC categories and the ontology categories. Then we propose a question analysis strategy based on the CFN, aiming at the triple of the questionSemantic predicate, semantic subject and semantic object. On the basis of the CFN semantics analysis, the answer is extracted from the tourism ontology base. This approach is implemented by the ontology editor Protégé, and the experiment proves the validness of this method.
  • Review
    SU Sui, LIN Hong-fei, YE Zheng
    2009, 23(2): 41-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    Content-based spam filtering is one of the mainstream technologies used so far. After a briefly review of the state-of-the-art of spam filtering based on content, this paper proposes a character-based language modeling approach used in spam filtering task on the basis of these technologies. We experimentally compare the performance of this approach with Nave Bayes、SVM and Word-based language modeling approach. Our experimental results show that character-based language modeling approach can achieve high performance, and can be easily applied in on-line large-scale e-mail system.
  • Review
    CEN Fang-ming, WANG Ming-wen, WANG Peng-ming, DAI Yu-juan
    2009, 23(2): 48-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    The spam is one of the most serious problems to be resolved in the Internet. Recently, several spam filtering technologies have been proposed and applied to spam filtering, such as the Partial Least Squares (PLS) method. The PLS method can deal with the sparse data, the high dimensionalities and the multi-colinearity issues existing in the e-mail dataset. However, the latent content relationships among the e-mail data are, more often than not, nonlinear. This paper introduces the kernel function over PLS method to capture such non-linearity. Compared with PLSR method, the proposed KPLS model is proved with superior efficiency in the experiments on the Enron-Spam dataset.
  • Review
    ZHANG Lei, LI Ya-nan, WANG Bin, LI Peng, JIANG Zai-fan
    2009, 23(2): 54-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    The session in query logs of web search denotes a sequential series of queries from a user when he is searching for certain information during a period of time. Correct session segmentation is a fundamental work for various researches such as searching activities analysis. Due to the unsystematic research on session at present, this paper redefines the conception of session and does several comparative studies. We conclude that (1) the statistical language model is not suitable for session segmentation because of the heavy data sparseness and (2) the decision tree method using multiple attributes can obtain very promising results. Evaluated at the session level, the decision tree based method achieves a F-measure up to 78.6%.
  • Review
    ZHANG Yu, FAN Ji-li, ZHENG Wei, ZOU Bo-wei, LIU Ting
    2009, 23(2): 62-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Personalized information retrieval can grasp the users’ retrieval intention and find personalized results. A manual annotation system is designed in this paper to generate the corpus for evaluating personalized IR system. Then the User-centered manual annotation strategy is proposed for personalized IR evaluation. The evaluation system adopts the evaluation scheme provided by NIST performs an automatic evaluation according to the manually annotated results,and generates the quantified and straight-forward measurement results.
  • Review
    JI Duo, ZHENG Wei, CAI Dong-feng
    2009, 23(2): 69-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    Latent Semantic Indexing (LSI) has been applied to many fields, such as information retrieval, text classification, automatic question answering and so on. Basically, LSI is a dimensionality reducing method by projecting term co-occurrences into the same space. Therefore, in the semantic space of LSI, term co-occurrences are obtained by the term transfer relation both in single document and between different documents. This paper suggests that this term transfer relation causes some nonexisted term co-occurrences, which reduce the performance of the LSI. To eliminate nonexistent term co-occurrences, this paper further adopts documents frequency to select features in document sets, and experiments with Complete-Link clustering algorithm on two public corpora. The experimental results show that the F-measure of clustering increases by 6.577 0%, 1.992 8% and 3.361 4% when documents frequency are reserved between 10% and 15%.
  • Review
    HUANG Guo-bin, WANG Ming-wen, YE Hao
    2009, 23(2): 77-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are four main approaches to present cross-language information retrieval (CLIR)query translation approach, document translation approach, interlingua representation approach and translation-free approach. After discussing the advantages and disadvantages of these four approaches, this paper proposes a novel translation-free approach based on interlingua semantics. We test our approach on TREC cross-language corpus, and compare it with the mono-lingual information retrieval model. The results prove that our approach bears good performance and robustness.
  • Review
    LUO Xiong-wu, WAN Xiao-jun, YANG Jian-wu, WU Yu-qian
    2009, 23(2): 83-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Organizing web search results into clusters is helpful for users to browse through search results. Many clustering methods have been widely used for this purpose, but most of them do not work well because the generated cluster labels are not readable and informative enough for users to identify the right cluster quickly. In this paper, we focus on how to generate more readable cluster labels and propose a novel method to address this problem. Based on the ranked list of snippets returned by a web search engine for a given query, we first construct a suffix tree for these snippets. Then we calculate scores for all the phrases in the tree by leveraging their statistic and syntactic information. Finally, we rank the phrases in descending order of their scores, and then select the top k phrases as the final cluster labels. Having the labels, we can form clusters by assigning each snippet to the relevant label. Experimental results show that our method works well for clustering web search results.
  • Review
    CHEN Hai-qiang, CHENG Xue-qi, LIU Yue
    2009, 23(2): 89-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    The detection of core members in the virtual communities is of great value for many applications, e.g. community mining. To solve this issue, this paper first analyzes the distribution of interest similarity among the community members, finding that the interest profiles of those core members are more similar to each other than those occasional members. Therefore, an algorithm is proposed to detection the core members in virtual communities by interest clustering. This algorithm is evaluated in a real world data set from Douban.com, and produces satisfactory results.
  • Review
    YU Hui-jia, LIU Yi-qun, ZHANG Min, MA Shao-ping, RU Li-yun
    2009, 23(2): 95-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and crosslingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and crosslingual coreference resolution are far from the requirement of largescale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and crosslingual conference resolution should be carried beyond the natural language texts and should be investigated directly among the largescale, redundant, heterogeneous, illformed and noisy web pages.
  • Review
    SONG Rui, LIN Hong-fei, CHANG Fu-yang
    2009, 23(2): 102-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic comparative sentences identification and comparative relations extraction contribute to opinion mining and information recommendation. This paper constructs a Chinese Comparative Pattern Database to identify comparative sentences. Moreover, several types of features are chosen to establish a Condition Random Field (CRF) model for the comparative relations extraction. Experiment shows the Chinese Comparative Pattern Database contributes to the identification of comparative sentences, and the proposed types of feature set are valid to improve the result of comparative relation extraction by CRF model.
  • Review
    HAN Xian-pei, ZHAO Jun
    2009, 23(2): 108-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic metadata, which provides semantic information about data, plays an important role in document management, fusion and information search. The automatic metadata generation technique, which subsumes the acquisition of target semantic metadata and the collection of training corpus as two fundamental problems, becomes more demanding in the data explosion time. The first problem involves expert knowledge and the second problem needs lots of manual work, and accordingly, they are critical to a successful system. In this paper, we resolve the two problems based on Wikipediaextracting the target metadata by analyzing the table-of-contents of Wikipedia's entries and building the training corpus by analyzing the Wikipedia entry's structure and assigning its true semantic metadata. The experiment results demonstrate that this approach can resolve the two issues in automatic metadata generation effectively.
  • Review
    ZHONG Mao-sheng, LIU Hui, LIU Lei
    2009, 23(2): 115-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    The quantitative research of semantic relation between words is an essential subtask for some natural language processing task. Generally, semantic relation between words includes three types of relations, namely, synonymy relation, hyponymy relation and relevance relation. The existing quantitative researches of semantic relation between words are mostly focused on how to quantify the synonymy relation (or similarity relation) between words. In this paper, we study and present a novel approach to quantity the semantic relevance relation between words by constructing the bipartite graph of lexical relevance relation. Moreover, our approach can resolve the measurement of the semantic relevance relation between words without co-occurrence in the corpus. The experiment results show that our approach is more feasible than the mutual information. For a specific word, our approach can generates a relative reasonable trend result on its semantic relevance relation to other words.
  • Review
    XIE Yong-kang, ZHOU Ya-qian, HUANG Xuan-jing
    2009, 23(2): 123-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel method to implement coreference resolution. This method is based on spectral clustering. A maximum entropy model is first used to get the coreference probability of mention pairs with extracted features. The probabilities of mention pairs are then used to construct the similarity matrix for spectral clustering. Entities are generated according to the clustering cuts. This method can divide entities with a global view, which effectively improves precision. Experiments on ACE 2007 dataset show that the ACE Value of this method is 2.5% higher than that of baseline on Diagnostic task and Unweighted Precision is 5.4% higher.