2008 Volume 22 Issue 2 Published: 15 April 2008
  

  • Select all
    |
    Review
  • Review
    LI De-yi, XIAO Li-ping,
    2008, 22(2): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Pattern recognition, knowledge engineering, and robotics have been made significant progress in the 50-year history of artificial intelligence; however, AI is far away from human intelligence. In order to fulfill the requirement of data mining, machine learning, and knowledge discovery, there are three important directions of AI research and development in the network age, which are cognitive physics, intelligence with uncertainty, and networked intelligence, discussed in details in this paper.
  • Review
    MI Hai-tao, XIONG De-yi, LIU Qun
    2008, 22(2): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    External resources can be used effectively to improve the parsing accuracy. In this paper, we introduce an external Chinese lexical analysis system to parsing and propose a general transformation methd to integrate them. The Transformation-based Error-driven Learning and Conditional Random Fields are used to solve the problem of transformation between two different standards of segmentation and POS tagging. We also propose a parsing model which combines the Head-driven parsing model and Structural Context parsing model effectively. Experimental results show that our new integrated parsing model achieves an F1 score of 82.5% on the Penn Chinese Tree-Bank Version 1.0, higher than the state-of-art parsers.
  • Review
    YU Zhong-hua, CHEN Rong, HU Jun-feng, CHEN Yuan
    2008, 22(2): 18-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    Information extraction from biomedical literature is very useful for utilizing the achievements in biomedical field and promoting further improvement of Biology and Medicine. This paper, aiming at biomedical abbreviation analysis and understanding, proposes an approach for disambiguating biomedical abbreviations based on K-nearest neighbor (K-NN) with weighted voting. In the approach, the samples with labels are generated automatically based on the hypothesis of “One Sense Per Discourse”. And the wordsdescribing the topic of a discourse are chosen as the features for abbreviation disambiguation. The classification model used in the approach is based on K-NN with weighted voting.The experimental results on a testing set containing 177 762 Medline abstracts show that the approach proposed in the paper can obtain higher precision than others in related work. The experiments also prove that K-NN with weighted voting can get not only higher precision, but also better stability in comparison with the traditional K-NN in abbreviation disambiguation task.
  • Review
    PANG Ning, YANG Er-hong
    2008, 22(2): 24-27,54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference is a common phenomenon in news reports about paroxysmal event. Conreference resolution is essential for information extraction. In the paper, we present an approach of coreference resolution based on Maximum Entropy Model in Chinese news reports about paroxysmal events. By the approach, we can extract the pronouns, nouns and noun phrases which refer to the same entity in a news report. The training corpus contains 200 000 Chinese characters and the testing corpus contains 100 000 ones. Eight kinds of features are chosen for the Maximum Entropy Model according to the characteristics of the problem. The experimental results show that the approach can achieve a F-measure of 64.6%.
  • Review
    XIONG De-yi, LIU Qun, LIN Shou-xun
    2008, 22(2): 28-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an overview of recent syntax-based statistical machine translation (SMT). According to the differences of grammars which translation models are based on, we classify the syntax-based SMT into two categories: formally syntax-based SMT and linguistically syntax-based SMT. For each category, we discuss the representative work, including model design, training and decoding. We also make a comparison of different models. Finally, we point out the problems on designing syntax models and give a prediction of future development of syntax-based SMT.
  • Review
    HUANG Jin, LV Ya-juan, LIU Qun,
    2008, 22(2): 40-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    Parallel corpora are an indispensable resource for translation model training in statistical machine translation (SMT) system. Instead of collecting more and more parallel training corpora, this paper aims to improve the performance of SMT system by exploiting full potential of the existing parallel corpora. We propose an approach to select and optimize training corpus by using information retrieval method. First, sentences similar to the test text are selected to form a small and adapted training data. This allows us to get a comparable or even better performance with only a subset of the total data and the less hardware need. Second, we add the selected subset to the entire corpus to optimize the data distribution and get a better result. The experiments show that this method can effectively improve the performance of SMT system .
  • Review
    BAI Shun
    2008, 22(2): 47-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes the implementation of a Japanese-Mongolian verbal phrase machine translation system of verb phroses.In the Japanese derivational grammar, there is no concept of conjugations, a word is analyzed into stems and suffixes. After translating Japanese stems and suffixes into Mongolian stems and suffixes, Mongolian phonetic rules are used to process and generate verbal phrases. We implemented a Japanese-Mongolian verbal phrase machine translation system. We also tested 403verb phrases from 30Japanese reports, and achieved a 95.78% accuracy.
  • Review
    ZHANG Jian-feng,ZHANG Qi, WU Li-de, HUANG Xuan-jing
    2008, 22(2): 55-59,86.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel method to extract the subjective relationship between opinion-bearing terms and opinion targets. This method extracted the pairs of opinion-bearing terms and opinion targets as the candidate set, and then employed the maximum entropy model to combine lexical, part of speech, semantic and positional features derived from text. Our method incorporates relation extraction into opinion mining and solves the problem of coreference and omitting of opinion targets to some extent. The experiments showed that the F value of our method is 15% higher than that of Baseline which takes the nearest opinion target as the real target, Besides, the experiments found that the intensifiers can improve the performance of subjective relation extraction.
  • Review
    ZHU Lei, JIANG Jie, ZHENG Rong, XU Bo,
    2008, 22(2): 60-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Speaker retrieval has recently emerged as an important task due to the rapidly growing volume of audio archives. This paper presents a novel approach to accelerate the speed of speaker recognition. This approach combines the state-of-the-art speaker recognition system (GMM-UBM system) with Index and Simulation, and it can accelerate the speed greatly with little reduction in accuracy. For the details of this approach, a two-time search strategy is proposed for this task. First, we calculate the Euler distance between two indexes to find some candidates, and then we use the Simulation to find the best target. The experimental results show that our approach effectively improves the speed of the process with little degradation in accuracy.
  • Review
    CHEN Yao-dong, WANG Ting, CHEN Huo-wang
    2008, 22(2): 70-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic analysis is one of the fundamental and key problems in the research of content-based Text Mining. Most of supervised machine learning methods led to poor performance when work on limited tagged data. This paper investigated a novel semi-supervised learning algorithm—Transductive Support Vector Machine for shallow semantic parsing. An optimization strategy of selecting training instances, based on active learning, was integrated with TSVM. The experiment result shows that the method integrating TSVM and optimization strategy for shallow semantic parsing outperforms supervised methods on small tagged data.
  • Review
    WANG Yong, LIU Yi-qun, ZHANG Min, MA Shao-ping, RU Li-yun
    2008, 22(2): 76-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    The activeness of a web page varies during its lifetime. Some pages are valuable only in a specific period, and then become obsolescent. Web page lifetime analysis from users’ perspective is important to enhance the performance of web crawlers and search engines, and to improve the efficiency of web advertising. With page view data collected by a proxy server, we were able to perform large scale analysis in web page lifetime. A model is given to describe user interest evolution based on an experiment conducted with the page view data of more than 36 000 000 web pages for two months. The model is the foundation to better understand how the web is organized and operates.
  • Review
    FU Yu-peng, ZHANG Min, MA Shao-ping
    2008, 22(2): 81-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Enterprise search has been more and more important in research as information technology develops. Discussion search in enterprise email collection is a frequently faced problem. There is a large volume of emails containing valuable information in enterprise corporations. Therefore how to retrieval required data from those emails effectively is important. In this paper, according to the structure feature of emails and its semantic topology study, we introduce the email features based retrieval model. In TREC2006 discussion search task, our model achieved the best performance among all participants.
  • Review
    GUO Rui,SONG Ji-hua,LIAO Min
    2008, 22(2): 87-91,105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Along with the Corpus Linguistics’ prosperity and development, the research on Example Based Machine Translation (EBMT) has a flourishing prospect. In this area, two problems must be solved: 1) Constructing a large-scale parallel corpus with high accuracy and speed. 2) Searching the most similar sentence with the input sentence from the huge aligned examples. This paper aimed at EBMT between ancient and modern Chinese. First, a new translation model was built which takes the length of the sentence, character information and punctuation into account at the same time. Then, a new approach for aligning bilingual sentences automatically was proposed based on genetic algorithm and Dynamic Programming. Finally, a new similarity method was given based on Chinese characters’ information entropy. Experimental results showed that our methods achieved good performance.
  • Review
    FANG Gao-lin, YU Hao, MENG Yao,ZOU Gang
    2008, 22(2): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    As one of the important research topics, computer-aided Chinese learning is attracting more and more interest in natural language processing society. A computer-aided reading and learning system based on analysis unit of characters is proposed to provide reading and learning assistant for Chinese learner in this paper. The system first employs character-based Chinese morphological analysis for segmenting Chinese texts into words, and then presents a method based on structure information of constituent characters for new word finding. For unknown words unregistered in the dictionary (such as: technical terms, proper nouns and fixed phrases), a method based on semantic prediction and feedback learning is proposed to mine their native translations from the Web. For frequent words, real-time translation display is implemented by the Chinese-English (Chinese-Japanese) dictionary database, and users can also obtain typical examples of this word usage through a word usage retrieval module. In this system, key technologies include: morphological analysis based on character information, word segmentation based on structure information of constituent characters, and translation acquisition of new words based on semantic prediction and feedback learning. A character analysis unit is the core of all proposed methods used in the whole system. Experiments show that our system has good performance in every aspect.
  • Review
    YAN Zhi-jie,HU Yu ,WANG Ren-hua
    2008, 22(2): 99-105.
    Abstract ( ) PDF ( ) Knowledge map Save
  • Review
    SUN Cheng-li , LIU Gang, GUO Jun
    2008, 22(2): 106-109,128.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Minimum Classification Error (MCE) criterion based sub-words weighting parameters estimation algorithm is proposed in which the sub-word weighting parameters are derived by the MCE training. Investigation of the contribution of different sub-words on the word-level confidence measure show that Finals significantly outperform the Initials with more reliability and stability in confidence performance, and Finals have more discriminative power than those of Initials. Experiment on keyword spotting system with 130 keywords shows that the system with different sub-word weighting contribution achieved a relative Equal Error Rate (EER) reduction of 3.05% compared with the equal weighting contribution case.
  • Review
    GUO Qing, Nobuyuki Katae, YU Hao, Hitoshi Iwamida
    2008, 22(2): 110-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Fujitsu Mandarin TTS system is a state-of-the-art, unit-selection based concatenative speech synthesis system. This paper describes the current status of the system, especially in prosody generation related aspects. Decision tree based duration prediction method and statistical pitch contour prediction method are described in detail. At last, the prosody evaluation result and the system evaluation result are presented.
  • Review
    SHAO Yan-qiu, SUI Zhi-fang, HAN Ji-qing,WU Yun-fang
    2008, 22(2): 116-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    Different prosodic hierarchy could divide texts into several prosodic chunks for better speaking and understanding. Currently, many shallow features such as part-of-speech, length of word are used to predict the prosodic hierarchy. But these features are not powerful for some prosodic unit prediction such as prosodic phrase. In fact, syntactic structure is in close touch with prosody structure. They influence and restrict each other. In this paper, based on dependency grammar, some deep features which are related with prosody hierarchy are extracted. Compared to the shallow features, the deep features such as inner-arc span and inner-arc type are more effective on the prediction of the middle level such as prosodic phrase. The F-score increases about 11%.
  • Review
    WANG Shi-jin, ZHENG Rong, XU Bo
    2008, 22(2): 124-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper presents an automatic language identification (LID) system which is based on the lattice-based PPRLM method. As an extension of the original PPRLM, the lattice-based PPRLM method uses lattice to generate the acoustic hypothesis space, which contains more information than that of 1-best phoneme sequence in the original PPRLM. Evaluations from the broadcasting speech in real environments show that the lattice-based PPRLM improves the accuracy rate by 6%. Results are also comparable with other approaches within different languages, while four-hour training set is given for each language.