2009 Volume 23 Issue 1 Published: 16 February 2009
  

  • Select all
    |
    Review
  • Review
    ZHUANG Cheng-long, QIAN Long-hua, ZHOU Guo-dong
    2009, 23(1): 3.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes an improved tree kernel-based approach to entity semantic relation extraction, where the performance is improved by incorporation of entity-related semantic information into, the structured representation of relation instances and the pruning of redundant information. Starting from the Shortest Path-enclosed Tree for a relation instance, entity-relation semantic information, such as entity types, subtypes, and mention types etc., are first uniformly appended. Then modifications to noun phrases and redundant information in conjunction coordination structures are removed away, but the possessive structure is further included. With such generated appropriate representation of the relation instance, experiments on the ACE RDC 2004 benchmark corpus shows that our method significantly improves the performance, achieving the F-measure of 79.1% and 71.9% on the task of relation detection and top-level relation extraction respectively.
  • Review
    CHENG Yue, CHEN Xiao-he
    2009, 23(1): 9.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new method to recognize the Chinese verb-object collocation is proposed on the basis of the conditional random fields (CRFs) model. The CRFs based model is examined with verb subcategorization features, context features, and features of their combination. The experiments are carried on two different Chinese word segmentation and part-of-speech tagging settings, with part-of-speech filtering rules to optimize the experiment. The results show that the best performance is 87.40% in F-score over Tsinghua Chinese Treebank, and 74.70% in F-score over the segmentation and part-of-speech tagging scheme of Peking University. Experimental results show that CRF model is effective in recognizing Chinese verb-object collocation automatically.
  • Review
    HUANG De-gen, YU Jing
    2009, 23(1): 16.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a distributed strategy for Chinese text chunking on the basis Conditional Random Fields(CRFs) and Error-driven technique. First eleven types of Chinese chunks are divided into different groups to build CRFs model respectively. Then, the error-driven technique is applied over CRFs chunking results for further modification. Finally, a method is described to deal with the conflicting chunking according to the F-measure values. The experimental results show that this approach is effective, outperforming the single CRFs-based approach, distributed method and other hybrid approaches in the open test by achieving reaches 94.90%, 91.00% ,and 92.91% in recall, precision, and F-measure respectively.
  • Review
    WANG Hai-dong, HU Nai-quan, KONG Fang, ZHOU Guo-dong
    2009, 23(1): 23.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a machine learning-based approach to coreference resolution with special focus on the semantic role labeling information of the anaphor and the antecedent candidate. We first combine the baseline system with semantic role features which are acquired from ASSERT system. Furthermore, we analyze the integration of semantic role feature with detailed pronoun type knowledge, which suggests that incorporating semantic role information of anaphor and its antecedent candidates is beneficial to coreference resolution, especially to pronouns. Evaluation on the ACE-2003 NWIRE benchmark corpus shows that systems with proper handling of semantic role information achieves significant improvements of 3.4% in recall and 1.8% in F-measure respectively.
  • Review
    MO De-min, LIU Yao-jun
    2009, 23(1): 30.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a modified Wu-Manber algorithm based on theε-free subsuffix for multiple patterns matching . The algorithm reduces the amount of string matching by collecting patterns with common subsuffix. The experiments based on documents provided by Sogou indicate that the suggested algorithm can significantly improve the efficiency of string matching compared with the original Wu-Manber algorithm and its modified version.
  • Review
    CAI Zhi-jie
    2009, 23(1): 35.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Tibetan information processing, the word is to be treated as the fundamental unit for parsing, the sentence comprehension, the automatic abstract, the automatic classification, the machine translation and so on, Therefore, Tibetan word segmentation is essential for Tibetan information processing. Through the analysis of abbreviated word in Tibetan,, this article proposes a new method of restoration to identify the abbreviated word for Tibetan word segmentation. The basic idea of the restoration method is to re-establish the abbreviated Tibetan word to its original form by the reinstallation rules. The method has been applied in a research project of National Language Committee, with a testing result from a 850 000 byte Tibetan corpus reaching the accuracy of 99.83%.
  • Review
    WANG Chen, SONG Guo-long, WU Hong-lin, ZHANG Li, LIU Shao-ming
    2009, 23(1): 38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Phrase translation extraction is one of the key techniques in the Example-Based Machine Translation (EBMT),and its accuracy has a direct influence on the EBMT system performance. This paper proposes a phrase translation extraction method based on sequence intersection in which the sentence is taken as word sequence. Among Chinese-Japanese sentence aligned bilingual corpus, the source sentences containing the phrase are first searched out. Then the pairwise intersections of all these target sentences are acquired as the phrase translaiton. This approach can achieve high-quality phrase translations by mining the bilingual corpus, avoiding pre-possing steps like word alignment, parsing and dictionary. The experiments show our method achieves over 80% accuracy for the acquired phrase translation.
  • Review
    PANG Wei, XU Bo,
    2009, 23(1): 44.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a novel framework for Chinese-English name back-transliteration based on multiple models by using weighted finite-state transducers (WFST). Two grapheme-based models and two phoneme-based models are kernel of this framework. Combining those models with unified framework of WFST, we can build a system for Chinese-English name back-transliteration. Compared with single-model systems, the advantage of this method lies in combining those information from different models and maximizing the data available. Our experiments show that the proposed framework reduces 7.14% in error rate compared with the single-model.
  • Review
    YANG Pan, ZHANG Jian, LI Miao, Wudabala, XUE Yan
    2009, 23(1): 50.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an approach to morphology processing in Chinese-Mongolian statistical machine translation, attempting to resolve problems of the word form selection and the word re-ordering in translation generation. On the basis of the original Chinese-Mongolian parallel corpus which is morphologically analyzed and POS tagged, two corpora are derived for the morphological experiments. Then the statistical models, including the language model, the translation model and the generation model, are established. The issue of decoding expansion is also discussed. Finally we analyze the two experiments based on different morphological processing methodsmorpheme model experiment and factored method experiment. The results show that the BLEU scores of on the two morphological processing methods are better than the baseline system, revealing our method partially solved the problem of word form selection and word ordering.
  • Review
    CHEN Xiang, LIN Hong-fei
    2009, 23(1): 58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence alignment is an essential step in bilingual corpus processing. Sentence alignment of bilingual biomedical abstract is the first step to construct a biomedical bilingual lexicon. This paper describes a sentences alignment method using maximum weight matching on bipartite graph. After combing the sentence length and sentence location information, the anchor information is employed to calculate the paragraph similarity and sentence similarity in biomedical bilingual abstract. The good experimental results prove the effectiveness of our method.
  • Review
    LUO Zhun-chen, WANG Ting
    2009, 23(1): 63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Keyword extraction plays an important role in information retrieval, automatic summarizing, text clustering, and text classification, etc. A significant portion of keywords usually extracted are actually key phrases or the words not recorded yet, which makes the keyword extraction more difficult. This paper argues that the keyword extraction can be treated as two problemsextracting key words and extracting key phrases. A keyword extraction algorithm based on separate models was proposed, with different features developed for the two mentioned problems so as to improve the accuracy of keywords extracted from the Chinese documents. The experiment results show that the proposed algorithm has a better performance compared with the traditional keyword extraction algorithms.
  • Review
    FU Jian-bo , WANG Ming-wen , LUO Yuan-sheng , ZHANG Hua-wei
    2009, 23(1): 71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Document re-ranking is an effective measure to meet the user’s demand on high-precision information retrieval. This paper presents a document re-ranking model based on document clique, which is extracted from the document Markov network constructed form corpus. The incorporation of the document clique information into document re-ranking is proved valid with better precision than the BM25 model over adi, cacm, med, cisi and cran datasets.
  • Review
    XU Jun, ZHENG Jia-qian, YAO Jing, NIU Jun-yu
    2009, 23(1): 79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Spam filtering has some characteristics in common with stream data processing, such as high-volume scale, infinite increase and dynamical change. Traditional spam filtering methods use static feature selection approaches which cannot reflect that features of stream data are always dynamically changing as time goes by. In this paper, we propose a spam filtering method based on the characteristics of time stream which can adjust the effective features used for filtering in real time. The experimental results based on TREC spam track corpus show that our method could optimize the temporal and spatial cost of the filtering computation, while keeping the accuracy of the spam filter at a high level.
  • Review
    DONG Yan-ju, CAI Dong-feng, BAI Yu
    2009, 23(1): 86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Answer selection as a crucial step in question answering system is to choose the best answer from the candidates. The research issues include the criteria, the strategies, the methods and the evaluation for answer selection.This paper first illustrates the main answer selection criteria and analyzes the relationship between the criteria and the question answering evaluation. Then it summarizes the answer selection strategies into redundancy-based, similarity-based and reasoning-based strategy, presenting the algorithms and characteristics of each strategy. The evaluation measures for answer selection and the Answer Validation Exercise are also introduced. Finally, the paper discusses the major problems in answer selection and the prospects for its future research.
  • Review
    WANG Chao, LI Nan, LI Xin-li, LIANG Xun
    2009, 23(1): 95.
    Abstract ( ) PDF ( ) Knowledge map Save
    The financial information on Internet is more and more important to the stock market. Facing the countless news-most of which are non-strutted textwe adopt the sentiment of the news as an extra fact into the modeling of the price volatility of the financial market. The result proves that the correlations between the information sentiment and the asset price volatility, suggesting a new way to predict the stock market efficiently.
  • Review
    ZHANG Kai-xu, SUN Mao-song
    2009, 23(1): 100.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an approach to computer generation of Chinese couplets. After dividing the composition of Chinese couplets into hard rules and soft rules, this paper further points out the soft rules consists of character correspondence and context correspondence. A probabilistic graphical model is proposed for couplet generation based on the soft rules, with parameters estimated by EM (Expectation-Maximization) algorithm. The decoding of the model integrates hard rules as heuristics. The experiment result demonstrates that the candidate characters produced by this model are better than those produced simply by frequency. The model can even learn parameters from the data set containing some couplets with poor quality. The couplet generation program implemented by this approach bears an acceptable performance.
  • Review
    LI Yong-ping
    2009, 23(1): 106.
    Abstract ( ) PDF ( ) Knowledge map Save
    There have been many coding methods proposed for Chinese character input in the past. Almost all these methods need to store the spell-coded Chinese string and to select from multiple codes. In this paper we introduce a reverse retrieval of spell-coded Chinese characters without pre-storage of the spelling-code. This method solves the difficulties in the selection of multiple codes resulted from the conversion of Chinese string to spelling and the ambiguity caused by multiple codes between the users and the designers. We further analyze all of the Chinese words in GBK Library with multiple spellings and create an initial letters codes library. Finally we implement this reverse retrieval technique of spell-coded Chinese characters.
  • Review
    NI Chong-jia, LIU Wen-ju, XU Bo〖
    2009, 23(1): 112.
    Abstract ( ) PDF ( ) Knowledge map Save
    The technology of large vocabulary continuous speech recognition(LVCSR)has developed quickly and achieved broad application in recent years. Many big companies has reinforced the speech recognition research and various commercial systems have appeared in the market. This paper reviews the recent research progresses of LVCSR and describes the main frames and designs of current mandarin Chinese LVCSR systems. The key issues and principles in LCVSR are analyzed in detail. The prospects and research trends for LVCSR at home and abroad are also discussed.
  • Review
    CHEN Jing, MU Zhi-chun[
    2009, 23(1): 124.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research of Chinese characters cognition is an important aspect of cognitive science and computer science, especially artificial intelligence. According to the traits of Chinese characters, this paper proposes a Chinese characters font cognition model based on self-organizing neural network and adaptive resonance theory (ART). This model attempts to simulate the development process of Chinese characters cognition so as to reveal some essential rules in human learning of Chinese character. Through training and testing this model, the simulation results suggest that the model is able to account for some empirical results in Chinese character cognition development.