Journal of Chinese Information Processing

Select

Review

Survey on Named Entity Recognition, Disambiguation and Cross-Lingual Coreference Resolution

ZHAO Jun

2009, 23(2): 3-17.

Abstract ( ) PDF ( )

Knowledge map

Save

Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and cross-lingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and cross-lingual coreference resolution are far from the requirement of large-scale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and cross-lingual conference resolution should be carried beyond the natural language texts and should be investigated directly among the large-scale, redundant, heterogeneous, ill-formed and noisy web pages.

Select

Review

Dependency Parsing Based on Maximum Entropy Model

XIN Xiao, FAN Shi-xi, WANG Xuan, WANG Xiao-long

2009, 23(2): 18-22.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents three algorithms for dependency parsing based on the Maximum Entropy Models. The Maximum Spanning Tree (MST) algorithm achieves the best result. The target of MST is to find a Maximum Spanning Tree in a directed graph. Each edge of the directed graph corresponds to a dependency relation of the dependency parser, and the weights of the edges are obtained by using a Maximum Entropy Model. The training and test data sets are the CoNLL2008 share task corpora. The system achieves F1 scores of 87.42 and 80.8 for WSJ and Brown test data respectively, ranking sixth among all the competition teams.

Select

Review

Constructing an Answer Ranking Model Using Semantic Analysis and Statistical Method for Question Answering

LI Bo, GAO Wen-jun, QIU Xi-peng

2009, 23(2): 23-27.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes a new method to construct the answer ranking model for Question Answering System. The method leverages knowledge density-based features used in answer ranking and introduces a new feature--syntactic path--by using parsing analysis and establishes an evaluation function by using supporting vector machine regression model. The experiments show that the new model which involves the syntactic path feature achieves substantial improvements.

Select

Review

Research on Question Understanding for Cooperative Question Answering

ZHANG Yu, ZHAO Xin, LIU Ting

2009, 23(2): 28-33.

Abstract ( ) PDF ( )

Knowledge map

Save

Question understanding is an important part of Question Answering system, especially to the Cooperate Question Answering system in which questions provided by users are described in detail. This paper proposes an algorithm that combined dictionaries and paring to exploit these crucial questions narrations by extracting certain valuable keywords. Experiments show that our approach substantially improves the MPP and MAP of question answering system.

Select

Review

Design of Tourism Question Answering System Based on the Chinese FrameNet

LI Ru, WANG Wen-jing, LIANG Ji-ye, SONG Xiao-xiang, LIU Hai-jing, YOU Li-ping,

2009, 23(2): 34-40.

Abstract ( ) PDF ( )

Knowledge map

Save

Taking advantage of the semantic expression in Chinese FrameNet (CFN), this paper discusses the construction of the domain specific Chinese FrameNet semantic base using owl, and validate and analyze its effectiveness by the design of Question Answering System in the transportation domain. In the proposed QA system, the query questions are first classified by a combination of the TREC categories and the ontology categories. Then we propose a question analysis strategy based on the CFN, aiming at the triple of the questionSemantic predicate, semantic subject and semantic object. On the basis of the CFN semantics analysis, the answer is extracted from the tourism ontology base. This approach is implemented by the ontology editor Protégé, and the experiment proves the validness of this method.

Select

Review

Character-Based Language Modeling Approach for Spam Filtering

SU Sui, LIN Hong-fei, YE Zheng

2009, 23(2): 41-47.

Abstract ( ) PDF ( )

Knowledge map

Save

Content-based spam filtering is one of the mainstream technologies used so far. After a briefly review of the state-of-the-art of spam filtering based on content, this paper proposes a character-based language modeling approach used in spam filtering task on the basis of these technologies. We experimentally compare the performance of this approach with Nave Bayes、SVM and Word-based language modeling approach. Our experimental results show that character-based language modeling approach can achieve high performance, and can be easily applied in on-line large-scale e-mail system.

Select

Review

Spam Filtering Based on Kernel Partial Least Squares Classification

CEN Fang-ming, WANG Ming-wen, WANG Peng-ming, DAI Yu-juan

2009, 23(2): 48-53.

Abstract ( ) PDF ( )

Knowledge map

Save

The spam is one of the most serious problems to be resolved in the Internet. Recently, several spam filtering technologies have been proposed and applied to spam filtering, such as the Partial Least Squares (PLS) method. The PLS method can deal with the sparse data, the high dimensionalities and the multi-colinearity issues existing in the e-mail dataset. However, the latent content relationships among the e-mail data are, more often than not, nonlinear. This paper introduces the kernel function over PLS method to capture such non-linearity. Compared with PLSR method, the proposed KPLS model is proved with superior efficiency in the experiments on the Enron-Spam dataset.

Select

Review

Session Segmentation Based on Query Logs of Web Search

ZHANG Lei, LI Ya-nan, WANG Bin, LI Peng, JIANG Zai-fan

2009, 23(2): 54-61.

Abstract ( ) PDF ( )

Knowledge map

Save

The session in query logs of web search denotes a sequential series of queries from a user when he is searching for certain information during a period of time. Correct session segmentation is a fundamental work for various researches such as searching activities analysis. Due to the unsystematic research on session at present, this paper redefines the conception of session and does several comparative studies. We conclude that (1) the statistical language model is not suitable for session segmentation because of the heavy data sparseness and (2) the decision tree method using multiple attributes can obtain very promising results. Evaluated at the session level, the decision tree based method achieves a F-measure up to 78.6%.

Select

Review

Research on Evaluation of Personalized Information Retrieval Based on Manual Annotation

ZHANG Yu, FAN Ji-li, ZHENG Wei, ZOU Bo-wei, LIU Ting

2009, 23(2): 62-68.

Abstract ( ) PDF ( )

Knowledge map

Save

Personalized information retrieval can grasp the users’ retrieval intention and find personalized results. A manual annotation system is designed in this paper to generate the corpus for evaluating personalized IR system. Then the User-centered manual annotation strategy is proposed for personalized IR evaluation. The evaluation system adopts the evaluation scheme provided by NIST performs an automatic evaluation according to the manually annotated results,and generates the quantified and straight-forward measurement results.

Select

Review

Research on Feature Optimization in Latent Semantic Indexing

JI Duo, ZHENG Wei, CAI Dong-feng

2009, 23(2): 69-76.

Abstract ( ) PDF ( )

Knowledge map

Save

Latent Semantic Indexing (LSI) has been applied to many fields, such as information retrieval, text classification, automatic question answering and so on. Basically, LSI is a dimensionality reducing method by projecting term co-occurrences into the same space. Therefore, in the semantic space of LSI, term co-occurrences are obtained by the term transfer relation both in single document and between different documents. This paper suggests that this term transfer relation causes some nonexisted term co-occurrences, which reduce the performance of the LSI. To eliminate nonexistent term co-occurrences, this paper further adopts documents frequency to select features in document sets, and experiments with Complete-Link clustering algorithm on two public corpora. The experimental results show that the F-measure of clustering increases by 6.577 0%, 1.992 8% and 3.361 4% when documents frequency are reserved between 10% and 15%.

Select

Review

A Novel Cross Language Information Retrieval Model Based on Interlingua Semantics

HUANG Guo-bin, WANG Ming-wen, YE Hao

2009, 23(2): 77-82.

Abstract ( ) PDF ( )

Knowledge map

Save

There are four main approaches to present cross-language information retrieval (CLIR)query translation approach, document translation approach, interlingua representation approach and translation-free approach. After discussing the advantages and disadvantages of these four approaches, this paper proposes a novel translation-free approach based on interlingua semantics. We test our approach on TREC cross-language corpus, and compare it with the mono-lingual information retrieval model. The results prove that our approach bears good performance and robustness.

Select

Review

Suffix Tree Based Label Generation Method for Web Search Results Clustering

LUO Xiong-wu, WAN Xiao-jun, YANG Jian-wu, WU Yu-qian

2009, 23(2): 83-88.

Abstract ( ) PDF ( )

Knowledge map

Save

Organizing web search results into clusters is helpful for users to browse through search results. Many clustering methods have been widely used for this purpose, but most of them do not work well because the generated cluster labels are not readable and informative enough for users to identify the right cluster quickly. In this paper, we focus on how to generate more readable cluster labels and propose a novel method to address this problem. Based on the ranked list of snippets returned by a web search engine for a given query, we first construct a suffix tree for these snippets. Then we calculate scores for all the phrases in the tree by leveraging their statistic and syntactic information. Finally, we rank the phrases in descending order of their scores, and then select the top k phrases as the final cluster labels. Having the labels, we can form clusters by assigning each snippet to the relevant label. Experimental results show that our method works well for clustering web search results.

Select

Review

User Interest Based Detection of Core Members in Virtual Communities

CHEN Hai-qiang, CHENG Xue-qi, LIU Yue

2009, 23(2): 89-94.

Abstract ( ) PDF ( )

Knowledge map

Save

The detection of core members in the virtual communities is of great value for many applications, e.g. community mining. To solve this issue, this paper first analyzes the distribution of interest similarity among the community members, finding that the interest profiles of those core members are more similar to each other than those occasional members. Therefore, an algorithm is proposed to detection the core members in virtual communities by interest clustering. This algorithm is evaluated in a real world data set from Douban.com, and produces satisfactory results.

Select

Review

Web Spam Taxonomy via Spam Intention Analysis

YU Hui-jia, LIU Yi-qun, ZHANG Min, MA Shao-ping, RU Li-yun

2009, 23(2): 95-101.

Abstract ( ) PDF ( )

Knowledge map

Save

Named Entities are important meaningful units in texts. The recognition and analysis of named entities is of great significance in the field of Web information extraction, Web content management and knowledge engineering, etc. The research on named entities includes named entity recognition, disambiguation, coreference resolution, attribute extraction and relation detection, etc. Focusing on named entity recognition, disambiguation and crosslingual coreference resolution, the paper gives a thorough survey on the state of the art of these tasks, including the challenges, methods, evaluations, performances and the problems to be solved. The paper suggests that, the performances of the current systems of named entity recognition, disambiguation and crosslingual coreference resolution are far from the requirement of largescale practical applications. In the view of methods and approaches, named entity recognition, disambiguation and crosslingual conference resolution should be carried beyond the natural language texts and should be investigated directly among the largescale, redundant, heterogeneous, illformed and noisy web pages.

Select

Review

Chinese Comparative Sentences Identification and Comparative Relations Extraction

SONG Rui, LIN Hong-fei, CHANG Fu-yang

2009, 23(2): 102-107.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic comparative sentences identification and comparative relations extraction contribute to opinion mining and information recommendation. This paper constructs a Chinese Comparative Pattern Database to identify comparative sentences. Moreover, several types of features are chosen to establish a Condition Random Field (CRF) model for the comparative relations extraction. Experiment shows the Chinese Comparative Pattern Database contributes to the identification of comparative sentences, and the proposed types of feature set are valid to improve the result of comparative relation extraction by CRF model.

Select

Review

Semantic Metadata Generation: A Method Based on Wikipedia

HAN Xian-pei, ZHAO Jun

2009, 23(2): 108-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic metadata, which provides semantic information about data, plays an important role in document management, fusion and information search. The automatic metadata generation technique, which subsumes the acquisition of target semantic metadata and the collection of training corpus as two fundamental problems, becomes more demanding in the data explosion time. The first problem involves expert knowledge and the second problem needs lots of manual work, and accordingly, they are critical to a successful system. In this paper, we resolve the two problems based on Wikipediaextracting the target metadata by analyzing the table-of-contents of Wikipedia's entries and building the training corpus by analyzing the Wikipedia entry's structure and assigning its true semantic metadata. The experiment results demonstrate that this approach can resolve the two issues in automatic metadata generation effectively.

Select

Review

Method of Semantic Relevance Relation Measurement between Words

ZHONG Mao-sheng, LIU Hui, LIU Lei

2009, 23(2): 115-122.

Abstract ( ) PDF ( )

Knowledge map

Save

The quantitative research of semantic relation between words is an essential subtask for some natural language processing task. Generally, semantic relation between words includes three types of relations, namely, synonymy relation, hyponymy relation and relevance relation. The existing quantitative researches of semantic relation between words are mostly focused on how to quantify the synonymy relation (or similarity relation) between words. In this paper, we study and present a novel approach to quantity the semantic relevance relation between words by constructing the bipartite graph of lexical relevance relation. Moreover, our approach can resolve the measurement of the semantic relevance relation between words without co-occurrence in the corpus. The experiment results show that our approach is more feasible than the mutual information. For a specific word, our approach can generates a relative reasonable trend result on its semantic relevance relation to other words.

Select

Review

A Spectral Clustering Based Coreference Resolution Method

XIE Yong-kang, ZHOU Ya-qian, HUANG Xuan-jing

2009, 23(2): 123-129.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a novel method to implement coreference resolution. This method is based on spectral clustering. A maximum entropy model is first used to get the coreference probability of mention pairs with extracted features. The probabilities of mention pairs are then used to construct the similarity matrix for spectral clustering. Entities are generated according to the clustering cuts. This method can divide entities with a global view, which effectively improves precision. Experiments on ACE 2007 dataset show that the ACE Value of this method is 2.5% higher than that of baseline on Diagnostic task and Unweighted Precision is 5.4% higher.

Please choose a citation manager

Content to export

2009 Volume 23 Issue 2 Published: 15 April 2009