2009 Volume 23 Issue 3 Published: 15 June 2009
  

  • Select all
    |
    Review
  • Review
    LANG Jun, XIN Zhou, QIN Bing, LIU Ting, LI Sheng
    2009, 23(3): 3-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    The coreference resolution is an important subtask of information extraction. Recently statistical machine learning methods have been substantially attempted for this issue with some achievements. In this paper, we try to integrate the background semantic knowledge, which is a new subject being introduced in every field of NLP nowadays, into the classical pairwise classification framework for coreference resolution. We extract background knowledge from WordNet and Wikipedia, and exploit the semantic role labeling, general pattern knowledge and the context of mention as well. In the experiment, the feature selection algorithm is employed to decide the best features set, on which the maximum entropy model and SVM model are compared for their performance. The experimental results on ACE dataset exhibit the improvement of coreference resolution after adding selected background semantic knowledge.
    Key words computer application; Chinese information processing; coreference resolution; background knowledge; WordNet; wikipedia
  • Review
    XIE Yongkang, ZHOU Yaqian, HUANG Xuanjing
    2009, 23(3): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel method to implement coreference resolution. This method is based on spectral clustering. A maximum entropy model is first used to get the coreference probability of mention pairs with extracted features. The probabilities of mention pairs are then used to construct the similarity matrix for spectral clustering. Entities are generated according to the clustering cuts. This method can divide entities with a global view, which effectively improves precision. Experiments on ACE 2007 dataset show that the ACE Value of this method is 2.5% higher than that of baseline on Diagnostic task, and 5.4% higher in Unweighted Precision.
    Key wordscomputer application; Chinese information processing; coreference resolution; spectral clustering; maximum entropy model
  • Review
    WANG Fang, WAN Changxuan,
    2009, 23(3): 17-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    To identify the word is an elementary preprocessing for Chinese information retrieval. To capture the semantic integrity and the intension of the user query, this paper analyzes the characteristics of Chinese integrated word and presents three methods for its automatic detection by combining the mutual information, the prefix-and-suffix information of integrity word and the confidence for the integrated word. We further design and realize three prototype systems for Chinese integrated word identification based on the proposed overall-confidence, partial-confidence and weighted joint-confidence methods respectively. Finally, experiments on the 2nd SIGHAN(2005) PKU test corpus show that the performance of the system is good, capable of satisfying demands on different aspects of performance.
    Key words computer application; Chinese information processing; Chinese word segmentation; mutual information; reliability; automatic recognition
  • Review
    HAN Yan, LIN Yuxi, YAO Jianmin
    2009, 23(3): 24-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an approach to Chinese OOV identification based on extension according to statistics form Web resources. We extend the bigram OOV seeds by the left and right (LR) neighbors on the basis of OOV border judgment. It helps to identify OOV with integrated meaning without length restriction. Experimental results show that the approach is effective and feasible.
    Key wordscomputer application; Chinese information processing; OOV identification; lr_neighbor; MFLNR; MFRNR; candidate OOV extension
  • Review
    CHEN Zhumin, MA Jun, HAN Xiaohui, LEI Jingsheng
    2009, 23(3): 31-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    The performance of the focused crawler is crucial to a vertical search engine. Two scientific computation issues to be addressed in the design of focused crawlers are(1) how to compute the relevance of a current visited Web page to a given topic, (2) how to compute the priorities of unvisited URLs in the queue. For the first issue, this paper describes the calculation of the relevance of a page to the topic based on the page's topical text blocks and related link blocks. For the second one, a novel approach is proposed to prioritize these unvisited URLs by hierarchical topic context of four different granularities, i.e. site level, page level, block level and link level. Finally, a new focused crawling algorithm is presented. Experiments show that the new algorithm is more effective than three traditional algorithms in terms of precision rate and information amount without increasing time complexity.
    Key words computer application; Chinese information processing; focused crawling; URLs priority computation; page segmentation; relevance computation
  • Review
    WANG Yulin, SUN Le, LI Wenbo
    2009, 23(3): 39-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web query classification is of great significance in improving the performance of search engine. By analyzing and manually labeling real user query logs, we found that four kinds of words, as called “VASE” characterizing words, substantially characterizing the query category. We extracted such words and made an inverted index from them for the web queriy classification. We further propose a corresponding web extension and weighted characteristic words methods to improve the classification results. Experimental results show that the precision rate and recall rate reach 78.2% and 77.3% respectively, meeting the practical requirements.
    Key words computer application; Chinese information processing; Web query classification; “VASE” characteristic words; Web extension; weighted words
  • Review
    DONG Kansheng, FANG Jinyun
    2009, 23(3): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mobile POI Search has become one of the main applications in Mobile Search. With the characters input for Mobile Search and the structural feature of POI data, Jianpin was used in the Mobile POI Search to improve the user experience. Since word order similarity is the main factor to the ranking results, an algorithm based on vector distance is devised to compute word order similarity in this paper. The algorithm first establishes the Jianpin vector space model, extracts the common part from the two Jianpin vectors and maps it into position vectors. Then it figures out the similarity based on the distance between the position vectors. Theoretical analysis shows that, compared with the method based on reverted ordinal number, the proposed algorithm decreases the time complexity from O(nlogn) to O(n) and the space complexity from O(n) to O(1). Experimental results confirm that the proposed algorithm can ensure the precision and improve the efficiency by 16.88%.
    Key words computer application; Chinese information processing; mobile POI search; jianpin search; word order similarity; vector distance
  • Review
    LI Wenfa, DUAN Miyi, LIU Yue, SUN Chunlai
    2009, 23(3): 51-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Flow classification plays an important role in the research fields of network security monitoring, QoS(Quality of Service) and intrusion detection etc. The flow classifier is challenged by the huge amount of data with relevant and redundant features, which causes unnecessary heavy costs in training and processing as well as poor classification accuracy. For high dimension data, the feature selection can not only reveal true informative subset but also improve the accuracy and efficiency. In this paper, we propose a wrapper feature selection algorithm VFSA-C4.5 aiming at modeling lightweight flow classifier by (1) using VFSA (very fast simulated annealing) strategy to evaluate the candidate subsets at random; (2) using C4.5 algorithm as wrapper approach to determine the optimum feature subset. The experiments on on several flow datasets indicate that classifier with our approach can greatly improve computational performance without negative impact on classification accuracy.
    Key words computer application; Chinese information processing; flow classification;feature selection;very fast simulated annealing;decision tree
  • Review
    LIU Peng, ZONG Chengqing
    2009, 23(3): 58-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    The phrase-based statistical machine translation model is widely studied and applied in the circle of machine translation research. However, the model uses the strategy of precise matching in decoding, which suffers severely from the data sparseness problem, leaving most phrases in phrase table under-exploited in translation process. Therefore we propose a novel interactive approach to translation based on human-machine cooperation. For an unknown phrase, the system finds its similar phrases in the phrase table through fuzzy matching. Then a classifier is combined to judge phrases capable of improving the translation quality. At last, the phrase which has the same meaning with the unknown phrase is decoded through human-machine interaction. The experimental results on spoken language corpus show that this approach significantly improves the translation quality.
    Key words artificial intelligence; machine translation; spoken language translation; phrase-based statistical machine translation; human machine interaction; fuzzy matching
  • Review
    ZHAO Hongmei , LIU Qun, ZHANG Ruiqiang, LV Yajuan, EiichiroSUMITA,Chooi-Ling GOH
    2009, 23(3): 65-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a new guideline for Chinese-English word alignment. Starting from the existing Guidelines for Chinese-English Word Alignment (Linguistic Data Consortium , 2006), we propose a completely different classification for word alignment annotationgenuine link (involving strong link and weak link) and pseudo link. This explicit distinction can represent the characteristic of cross-lingual word alignment. The proposedguideline has been successfully applied in a large-scale task for Chinese-English Word alignment, achieving good intra- and inter-annotator agreemenst at the Kappa coefficients of 0.99、0.98、0.93 and 0.96、0.83、0.68 for the strong link, weak link and pseudo link respectively. And a further experiment proves that such annotated word alignment is useful for SMT system.
    Key words artificial intelligence; machine translation; annotation guidelines for Chinese-English word alignment; manual word alignment; genuine link; pseudo link; strong link; weak link; alignment and annotation agreement
  • Review
    HUANG Shujian, XI Ning, ZHAO Yinggong, DAI Xinyu, CHEN Jiajun
    2009, 23(3): 88-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    AER (Alignment Error Rate) is a widely used alignment quality measure. Recent study shows that the AER score is not well correlated with the BLEU score of the final translation result. In this paper, we analyze the possible reasons for this weak correlation in a phrase-based SMT environment. We also propose a new alignment quality measure ESAER (Error-Sensitive Alignment Error Rate) according to different alignment errors. Experimental result shows that ESAER gets a much higher correlation with BLEU score than AER.
    Key words artificial intelligence; machine translation; SMT; word Alignment; evaluation metric; AER; error-sensitive
  • Review
    WANG Jinjin, YANG Yun, ZHOU Changle,
    2009, 23(3): 95-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    The representation of the metaphor literal meaning is the premise of deeper semantic representation and the further processing of metaphor. In fact, as the input language of the metaphor computation system, the metaphor literal meaning representation language would influence the final semantic calcualtion results. In this paper, we first classify the Chinese Metaphor into two categories, the Un-Nested Metaphor and the Nested Metaphor, based on the analysis of Chinese metaphor characteristics. We further design the metaphor role dependency representation language to describe the metaphor literal meaning in terms of its shallow semantic and metaphor information. Finally, the experiments show that the proposed method is quite effective in interperating Chinese metaphor.
    Key words computer application; Chinese information processsing; metaphor literal meaning; metaphor computation; un-nested metaphor; nested metaphor; metaphor role dependency representation language
  • Review
    JIA Yuan, LI Aijun, MA Qiuwu, XIONG Ziyu
    2009, 23(3): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Through the acoustic and perceptual experiments, the presented study systematically investigates the phonetic realization and accents distribution of the typical focus-marking construction “[shi[…XP…]]” in Chinese. Results of experiments demonstrate that the pitch range of the under-focus position is enlarged significantly and the pitch registers of the following syllables are compressed successively; further, there contains an inmediate phrase boundary after the marked focus and an intonation phrase boudary before the second marker shi. Based on the results, the study further discusses the the controversial grammer issue of relationship between focus and accents, proposing that the variances on accent and pitch range can not be adopted to determine the location of focus though accents can be identified by focus and the pitch range of focus-bearing unit is enlarged.
    Key words computer application; Chinese information processing; shi-marked sentence; focus; accent; relation between focus and accent
  • Review
    SONG Rui, LIN Hongfei
    2009, 23(3): 110-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Proper processing of the document set based on its semantic structure helps bring about better multi-document summaries. In this paper, subject-object-predicate triples are firstly extracted from document set to construct document semantic graph. Then the edit distance-based clustering and PageRank algorithm are applied to optimize the graph structure and to assign weights to the vertices and links, respectively. Finally, triples with more weighted vertices and links are collected as the summary. Evaluated against the extraction-based summarization in terms of the ROUGE score on a set of manual generated summaries, it shows that the semantic graph-based summarization gained more overlaps with manually created summaries, and the edit distance-based graph structure optimization is positive to the the summarization quality.
    Key words computer application; Chinese information processing; document semantic graph; edit distance; Page-Rank; ROUGE; Chinese multi-document summarization
  • Review
    ZHANG Ruixia,ZHU Guiliang,YANG Guozeng
    2009, 23(3): 116-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new measure of semantic similarity between Chinese words is put forward on the basis of Knowledge Graphs. With HowNet (2005) as the semantic knowledge resource and the Knowledge Graphs as knowledge representation method, this method classifies “sememes” based on their semantic roles in HowNet, and then measures semantics similarity among the co-constructed word graphs by the similarity of different kinds of “sememe”. In order to evaluate the proposed measures of semantic similarity, a new model is designed for quantitative evaluation reuslt. With the help of this evaluation model, the experiment result proves that the effective degree of the our semantic measure is 89.1%.
    Key words computer application; Chinese information processing; knowledge graphs; HowNet; semantic similarity
  • Review
    MA Li, JIAO Licheng, BAI Lin, ZHOU Yafu, DONG Luobing
    (. Institute of Intelligence Information Processing, Xidian University, Xi’an, Shanxi 7007, China;
    . Information center, Xi’an Institute of Post and Telecommunications, Xi’an, Shanxi 7006, China;
    . Library ,Xidian University, Xi’an, Shanxi 7007, China)
    2009, 23(3): 121-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a new algorithm is proposed for extracting compound keywords from the Chinese document by the small world network. Using k-nearest-neighbor coupled graph, a Chinese document is first represented as a networkthe node represent the term, and the edge represent the co-occurrence of terms. Then, two variables, clustering coefficient increment and average path length increment, are introduced to measure term's importance and to generate the candidate keyword set. With factors such as co-operation between two any terms of part of speech in a sentence and the neighborhood between any two terms of the candidate set, some related words in the candidate set are combined as the compound keywords. The experimental results show that the algorithm is effective and accurate in comparision with the manual keywords extraction from the same document. The semantic representation by the compound keywords of a document is far more clearer than that of single keywords set, facilitating a better comprehension of the document.
    Key words computer application; Chinese information processing; small world network; term network graph; average shortest path length increment; average clustering coefficient increment; compound keywords