2011 Volume 25 Issue 5 Published: 17 October 2011
  

  • Select all
    |
    Review
  • Review
    WEI Chao, CHEN Fei, XU Danqing, ZHANG Min, LIU Yiqun, MA Shaoping
    2011, 25(5): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The rapid growth of Web data poses a great challenge in both storage and service quality for search engines. The existence of low-quality web pages, or rather spam pages, increases the cost of crawling, indexing, and storage in search engines. This paper presents a measure of Web page quality with 4 dimensionsauthority, content, timeliness and appearance. Human assessors are recruited to rate the sampled pages using this evaluation framework. High inter-rater reliability of the rating results showed that the framework is consistent and functional. Finally, Ordinal Logistic Regression analyses were conducted to model the relationship between the 4 core dimensions and quality of Web pages.
    Key wordsinformation retrieval; web page quality evaluation; Ordinal Logistic Regression
  • Review
    LUO Wenjuan1,2,MA Huifang3,HE Qing1,SHI Zhongzhi1
    2011, 25(5): 9-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    It remains a challenge to generate high-quality summaries that could concisely describe the original document without loss of information. In this paper, we argue that high-quality summaries should be compact while covering as much information in the original document as possible. Encouraged by this idea, we extract entropy and relevance to leverage the coverage and the compactness of summaries. We adopt supervised summarization methods based on regression methods to leverage these two features. Moreover, experiments on single and multiple document summarization show that effectively leveraging entropy and relevance could improve the quality of document summarization.
    Key wordsdocument summarization; sentence feature extraction; entropy; relevance
  • Review
    CAO Xinyu1,2, CAO Cungen1,2
    2011, 25(5): 17-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    The acquisition of part-whole relations is an important problem of knowledge acquisition. The Web becomes an important resource of knowledge acquisition. Search engine is an effective way to mining knowledge from the Web. The retrieval results containing part-whole relations are called corpus rich in part-whole relation in our paper. Because the current search engine is not semantic-based retrieval, it becomes a challenging issue to construct an effective query to retrieve documents containing part-whole relation from web. This paper gives a novel method of constructing query for acquiring corpus rich in part-whole relations from the Web. We use search engine and query string with context words related to part-whole relation to acquire corpus rich in knowledge about part-whole relation. By contrasting the method of manually constructing query and the method of constructing query based on corpus on the number of retrieve documents containing part-whole relation and the difficult degree expected from the retrieve documents, the result shows that our method was superior to others.
    Key wordspart-whole relation acquisition; corpus acquisition; query formulation
  • Review
    ZHAO Honggai, XIAO Shibin, WANG Hongjun, LV Xueqiang
    2011, 25(5): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    The "N+V" structure can constitute the phrase of three different structuresthe "N+V" nominal modifying structure, the "N+V" verbal modifying structure and the "N+V" subject-predicate structure. Based on Sogou log corpus, this paper studies subject-predicate phrase of the "N+V" structure from three aspects of its characteristics of each element, syllable characteristics and syntactic function, and especially describes "V" from the semantics. This paper also carries out in-depth analysis and confirmation of experimental data, and proposes a solution to the problem of phrase structure ambiguities, which provides important theoretical basis for improving the retrieval quality of Chinese search engine and the construction of phrase dictionary which is used by search engine.
    Key wordssearch engine; "N+V" structure; subject-predicate phrase; syntactic function
  • Review
    FANG Qi, LIU Yiqun,ZHANG Min, RU Liyun, MA Shaoping
    2011, 25(5): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web users can bookmark Web pages and then access them rapidly by using the browser bookmarks. The research in user behavior based on browser bookmarks will be instructive and meaningful to user personalization, web page quality evaluation and large-scale web page construction. We present an analysis of the browser bookmarks dataset consisting of approximately 270 thousands users in terms of organization structure, collection content and user interest. To begin with, we propose BBCM (bookmarks browse click model) to analyze structure features and utilization efficiency of browser bookmarks. Then by comparing with PageRank, we find that users are inclined to bookmark the pages with high-quality. Finally user interest characteristics are shown with the help of ODP (open directory project).
    Key wordsbrowser bookmarks; user behavior analysis; bookmarks browse click model
  • Review
    MA Hongyuan1, 2, WANG Bin1
    2011, 25(5): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query results caching and prefetching is an effective way to enhance the performance of Web search engines. We present an analysis of query logs originated from a famous Chinese Web search engine and describe the characteristics of Web search engine queries. A query results caching and prefetching approach based on query characteristics is proposed in this paper. The approach contains predictive models of query results page number and a caching and prefetching algorithm framework in Web search engines. We then use a real large scale query logs for a period of 2-months to evaluate the approach, in contrast to the traditional methods and theoretical upper bounds. Experimental results show that this approach can achieve 3.5% to 8.45% increase for all requests as compared with state-of-the-art methods.
    Key wordssearch engine; performance optimization; query results; caching; prefetching
  • Review
    TAN Hongye1,2, ZHENG Jiaheng1, LIANG Jiye1
    2011, 25(5): 44-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Temporal relation recognition is one of the fundamental tasks in the semantic process of natural language and become a popular research issue. In this paper,we describe the relative annotation standard, corpus and TempEval evaluations. And we analyze many methods of automatically annotating temporal relations. In addition, we discuss future research emphases in temporal relation recognition.
    Key wordstemporal relation; natural language processing; survey
  • Review
    LIU Haixia, HUANG Degen
    2011, 25(5): 53-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    We focus on building a system for labeling Chinese functional chunks automatically, through detecting the boundary of Chinese functional chunks and labeling the functional information in a sentence with correctly word segmenting and POS tagging. This paper proposes an approach that combines the feature template optimizing strategy with Conditional Random Field Model for labeling Chinese functional chunks automatically. On the testing data set, the precision, recall and F-1 measure of Chinese functional chunks reaches 85.84%, 85.07% and 85.45% respectively. On the basis of that, existing language resources Chinese thesaurus “Tongyici Cilin” is introduced into the processing module, from which the semantic information will be added to the feature template to remit the effect of data sparseness and ambiguous problem. In this case, the three performance indexes are increased to 86.21%、85.31% and 85.76% respectively.
    Key wordsChinese functional chunk; Conditional Random Fields (CRFs); semantic information; ambiguous structure
  • Review
    WANG Houfeng
    2011, 25(5): 60-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Abbreviation is a typical kind of language unit widely occuring in natural languages and contributes most OOVs which cause big difficulties for Natural Language Processing at different levels. This paper firstly makes a survey of properties and construction form of Chinese abbreviation; then classifies Chinese abbreviation problems into four aspects and sums up the related works for each one; finally investigates English abbreviation processing and compares some differences on abbreviation between in Chinese and in English.
    Key wordsabbreviation identification; abbreviation expansion; abbreviation prediction; abbreviation mining
  • Review
    LI Feng, YI Mianzhu
    2011, 25(5): 68-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Except for a few theoretical studies, morphological analysis for Russian as an indispensable module in Russian language processing has not yielded any case capable of being applied to real application so far in China. After summarizing the automatic analytical morphological methods of Russian morphology in both China and abroad, this thesis conducts an in-depth examination on the Russian morphological analyzers typical in Russian, Europe and American countries, and brings forward an automatic analytical method of Russian morphology that integrates multiple strategies. This method has been demonstrated by the experiments to be capable of generating satisfying results even if it is applied to professional fields.
    Key wordsnatural language processing; russian language; automatic morphological analysis; algorithm
  • Review
    CHEN Chen1,2, WANG Houfeng1
    2011, 25(5): 75-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    Cross-document personal name disambiguation is the process of determining if an identical name occurring in different texts refers to the same person in the real world. With the increasing need for multi-document applications, for example, multi-document summarization and information fusion, cross-document name entity disambiguation has drawn much attention. This paper employs a social network based algorithm for cross-document personal name disambiguation. This method uses the spectral clustering approach, compares the results of different graph partition criteria, and chooses the modularity threshold as the stopping measure for graph partition. Experiments datasets are built by CLP 2010 Chinese personal name disambiguation task. The results show that this method is promissing.
    Key wordscomputer application technology;personal name disambiguation;social network;spectral clustering;cluster-stopping measure;modularit
  • Review
    ZHU Zhenfang1,LIU Peiyu2,LI Shaohui1,ZHAO Jing1, WANG Qianlong1
    2011, 25(5): 83-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the nonlinear problem of template generation in Chinese text filtering, genetic algorithm, which could find optimal solutions within the global context, is introduced into solving text filtering problem. At the same time, a new approach based on set theory is applied to prove the theoretical feasibility, and, an adaptive strategy of genetic operators is proposed for real application. Theoretical proof and experimental results, including text classification and text information filtering by this genetic algorithm, show that the method is feasible and could obtain better information filtering results.
    Key wordstext filtering; fuzzy theory; genetic algorithm; convergence
  • Review
    LIAO Xiangwen, LI Yihong
    2011, 25(5): 89-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    Identification of Chinese opinion sentences is an important task of Chinese Opinion Mining. It aims to identify subjective sentences which express opinion on some topic from document.. Because the opinion strength of Chinese sentence relates to not only the statistics of sentiment lexicon but also the factors such as syntactic and semantic features, identification of Chinese opinion sentences can not simply decided by TF-IDF score of sentiment words. This paper proposes a new method for the identification of Chinese opinion sentences based on N-gram Hyperkernel function. The method introduces syntactic and semantic features to construct N-gram Hyperkernel function, and then applies SVM based on the N-gram Hyperkernel function to identify opinion sentences. The experiments show that our method is effective and outperforms competitive methods based on polynomial kernel, radial kernel and n-gram kernel.
    Key wordsidentification of Chinese opinion sentences; N-gram hyperkernel function; opinion mining
  • Review
    JIANG Wenbin1,WU Jinxing1,2, CHANG Qing1,2,Nasanurtu2 ,LIU Qun1,ZHAO Lili1,3
    2011, 25(5): 94-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a generative statistical model for Mongolian lexical analysis. This model describes the lexical analysis result as a directed graph, where the nodes represent the stems, affixes and their tags, while the edges represent the transition or generation relationships between nodes. Especially in this work, we adopt three kinds of transition or generation probabilitiesa) probabilities of stem-stem transition, affix-affix transition and stem-affix generation; b) the transition or generation probabilities between the corresponding tags; and c) the generation probabilities between stems or affixes and their tags. Using the 3rd-level annotated corpus with about 200 000 words as the training data, this model achieves a word-level segmentation accuracy of 95.1%, and a word-level joint segmentation and tagging accuracy of 93%.
    Key wordsMongolian; lexical analysis; segmentation; POS tagging; stemming; directed graph
  • Review
    YAN Ke, DAI Lirong
    2011, 25(5): 101-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    Posterior probability is a promising feature for computers to judge testers pronunciation quality in computer assisted language learning systems. However, the discrepancy between posterior probability and evaluators criteria is obvious. This paper introduces “Phone Scoring Model” which transforms posterior probability to deal with the problem. Both linear and non-linear phone scoring models are investigated and we find thatclose formed solution can be obtained for linear phone scoring models and gradient descent method can be used for nonlinear phone scoring models. Experimental results based on 498 peoples live PSC database indicate that this approach can significantly improve system performanceapproximately 42% relative performance gain when posterior probabilities are calculated with all-phone probability space; approximately 23%~27% relative performance gain when probabilities are calculated with optimized probability spaces.
    Key wordspronunciation quality evaluation; phone scoring model; posterior probability; PSC
  • Review
    MEN Guangfu1, PAN Chen2, LIU Changqing1
    2011, 25(5): 109-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, research on Xixia characters developed deeply and a large number of Xixia documents have been published with their original forms at home and abroad. How to carry out the fast digitalization of those documents is of great importance. We first preprocess those documents by smooth and thin algorithm, then the elastic meshes are applied to each of the directional pattern and probability distribution of pixels within each mesh is computed as the features for this character. Finally, a lower dimension features are extracted by Linear Discriminant Analysis (LDA) method. Experiment on total 9600 samples of 240 categories of Xixia characters by 4-fold cross validation produces a result of recognition rate of 87.99%.
    Key wordsxixia characters; elastic mesh; direction features; LDA
  • Review
    CUI Rongyi, KIM Sejin
    2011, 25(5): 114-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, the information contribution of cardinal graphemes for classifying the structures of Korean characters is in vestigated. Firstly, the concept and computational method of structure distance between Korean characters is proposed to describe the dissimilarity of different character structures. Furthermore, an approach to partitioning equivalent classes of character structures and the probability distribution are discussed. Finally, the information distribution of the character structures is described by computing information gain of cardinal graphemes for classifying the structures of characters. The results of simulation experiment on actual Korean documents show that c1-v2, c1-v1-c3 and c1-v2-c3 types of characters possess prominent high probability of occurrence, and furthermore, v1,v2 and c3 type of graphmes make a greatest difference in classifying the structures of characters.
    Key wordsKorean character; equivalent class of character structures; structure distance; information gain
  • Review
    CAI Li, PENG Xingyuan, ZHAO Jun
    2011, 25(5): 120-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the widespread application of computer and fast development of computer techonlogy,computer aided test and computer adopted test have turned into realization. Assisted essay scoring system (AES) have become the next generation of computer aided tools in peoples expectation. Chinese AES is still in its infant stage. As we know that there is even no Chinese AES which can be widely used. We have done a lot of research on English AES. And we extracted some features described in the paper. However, the result was not promising. In this paper, we use the technology of statistical natural language processing and information retrieval to extract features. Then, we creatively integrate features such as the distribution of sample test score and a reviewers score into statistical model, which we call triple segmented regression. The result is tremendous good. The experiment shows that using our AES, we can only use half of the labour force to get the precision above 97%.
    Key wordsassisted essay score; Chinese; topic feature; writing level feature