2004 Volume 18 Issue 3 Published: 15 June 2004
  

  • Select all
    |
  • WANG Jin,CHEN En-hong,ZHANG Zhen-ya,WANG Xu-fa
    2004, 18(3): 2-9,61.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the enrichment of network information and the improvement of the user's needs , people are not satisfied with retrieving in the same kind of language. So Cross-Language Information Retrieval (CLIR) receives people's more and more concerns. One of kernel problemof CLIR is how to overcome communication obstacles between different languages. This paper proposes a novel semantic-based CLIR model Onto-CLIR. The model , basing on the technologies of traditional information retrieval , uses Ontology to describe the relevant domain knowledge in different kinds of languages. Thus the problems of semantic loss and distortion when translating between query language and retrieval language can be solved. In this way we can ensure that the model will follow user's query intention and get the expected results. We have done experiments to validate our approach. The experiments are designed to retrieve sport news in Chinese from Sina website with query in English. The experiment results demonstrate that when applying our ontology-based CLIR approach the increases of the retrieval recall and precision both have reached more than 10 percent , which shows that our approach is effective in improving retrieval performance.
  • WEI Yong-peng,CHEN Qun-xiu
    2004, 18(3): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Our Japanese-Chinese machine translation system has been transplanted from DOS to Windows system. During enlarging the resource scale and further developing , we find it still has some inconvenience. Such includes : the inconvenience of dictionaries management , the complexity of using developing tools , the unfriendliness of translation interface , the faultiness of logs maintenance. To solve these problems , we design this Multi-language oriented machine translation support environment subsystem. This subsystem implements the functions of managing dictionaries , integrating developing tools ,controlling translation , and maintaining system logs. We set the subsystem compatible with Unicode , so that it can be developed to a multi-language oriented system and internet-oriented system. And we use DLL (Dynamic Link Library) technology to realize translation parts and tools calling , reserving the capability of replanting with other systems.
  • ZHOU Qian,ZHAO Ming-sheng,HU min
    2004, 18(3): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods , Multi-Class Odds Ratio (MC-OR) , a variant of Odds Ratio which is often used in binary classification , and a new feature selection method based on Class-Discriminating Words (CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Na?ve Bayes classifier , training and test are carried out on two text sets with different class distribution. As the results indicate , MC-OR and CDW gain the best selecting effect.
  • CUI Huan,CAI Dong-feng,MIAO Xue-lei
    2004, 18(3): 25-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question Answering System can give users precise answer to the question presented in natural language. Currently , most of question answering systems use large scaled corpus as knowledge base to extract answer. However , the abundant web resource provides another ideal knowledge source for question answering system. The research result shows that using web resource as the information source for question answering system can get good performance for simple and factoid-based questions. This paper presents an answer extraction method based on the computation of sentence similarity between the question sentence and the candidate answer sentence. We also developed a web-based Chinese QA system. This system only utilizes the“text snippet”in the feedback of the web search engine as data resource for answer extraction. The experiment result indicates that the system can get relatively good results for the questions of the types of PERSON , TIME and NUMBER; the MRR of all questions is 0.51.
  • LI De-hua,LIU Gen-hui
    2004, 18(3): 33-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    One of the key issues in Natural Language Understanding (NLU) is which one of the meanings of a polysemous word or a multi-meaning sentence should be chosen. To deal with this problem , we should concern about not only disambiguation of word , but also of sentence and discourse. Former studies on context are just limited in modern linguistics , but not applied in NLU. Our research aims at constructing the theory of context-based NLU. Because context is one of the most important factors in pragmatics , we first discuss the meaning of context in modern linguistics , then put forward a formal definition of context in computational linguistics , and give examples of Chinese language understanding based on context. As a result , we point out that a new branch of computational linguistics - computational pragmatics will play an important role in the field of computational linguistics.
  • LI Ying,CHI Yu-huan
    2004, 18(3): 40-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the reflection of the studies of traditional semantics and classical philosophy , a brand-new thought about antithesis was brought forth. The main points include : the law about the unity of opposites proposed by philosopher Hegel couldn’t be directly applied in the processing of concepts and somehow re-categorization of antithesis is necessary. HNC Theory founded by Prof. Huang Zeng-yang proposes that antithesis be classified into two types , that is , Hegel Antithesis and Non-Hegel Antithesis. The former includes four sub-types and the latter twelve. The respective symbols and their meaning assigned by HNC theory are illustrated in this paper and possible applications in Chinese information processing are discussed.
  • ZHANG Rui-xia,ZHANG Lei
    2004, 18(3): 48-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper puts forward a model for Chinese baseNP analysis based on knowledge graphs. By using knowledge graphs as the method for knowledge representation and imposing HowNet as the semantic knowledge resource and utilizing the strategy which uses the semantic information primarily and the syntactic information secondarily , the model firstly creates a word graph for every substantive in the Chinese baseNP , then it merges word graphs as a chunk graph , at last it obtains a chunk graph about the structural information and the semantic information of the Chinese baseNP. So it not only analyses the syntactic structure of the Chinese baseNP but also analyses the semantic relations among the structural components in the Chinese baseNP and expresses the semantic relations as a knowledge graph. The experiment result given in the end proves the model to be effective for Chinese baseNP analysis.
  • LIU Tao,YE Zhen-xing,CAI Lian-hong
    2004, 18(3): 55-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aiming at handset devices with small memory , we employ k-medoids algorithmwith pitch contour as the feature to reduce the size of the speech corpus of the current Chinese TTS system. The result of objective evaluation and statistic analysis shows that the similarities of the samples in a same cluster and the dissimilarities in different clusters can be guaranteed. In this system , hybrid units composed of the semi-syllable units of initial and final and the conventional syllable units are used to construct the new corpus according to the analysis of the probable units in Mandarin TTS system. After the sample sets of the hybrid units are reduced by clustering algorithm respectively , the embedded Chinese TTS system is implemented on the PDA platform.
  • LUO Jun,OU Zhi-jian,WANG Zuo-ying
    2004, 18(3): 62-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    More and more attentions have been paid on speaker adaptation in recent speech recognition research , especially on widely used MAP and MLLR. These techniques apply to fast codebook adjustment when only limited amount of training data is available , and they demand original model to be speaker independent. This article introduces MLLR integrated Speaker Adaptive Training (SAT) method , which regards every individual's codebook as the result of linear transformation of speaker independent codebook and trains speaker independent codebook based on such concept. Since speaker - related information is extracted by this means , the trained codebook is more‘speaker independent’, so it would perform better in speaker adaptation.
  • CHEN Qiang,LV Jun-yang,XIA De-shen
    2004, 18(3): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The segmentation of handwritten Chinese amount strings has direct influence of the accurate rate of the recognition. In this paper , a two - stage approach consisting of coarse and fine segmentation is adopted. It can increase the accuracy of the segmentation and has good segmentation speed. For the characters , whose vertical projections combine together , but they don't connected themselves , we confine them in a window and segment them with a curve line which is acquired by connecting all the middle points in sequence , and this method is simpler and more accurate than other methods. For the characters connected only by one stroke , firstly we find the candidate segmentation points on the candidate stroke in the image of the thinned character , and then determine the best point evaluated by the concise principles which are presented in this paper , which increase the accuracy of the coarse segmentation. The method mentioned above has been applied to the segmentation of Chinese bank check amounts and get good results.
  • TAO Jian-hua,KANG Yong-guo
    2004, 18(3): 74-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional source-filter model has obvious limitation for speech synthesis in pitch modification due to the lack of spectrum distortion processing. To solve the problem , the paper compares spectrum features of voice source in various F0 ranges and timbres in detail , and generates Muliti - Source (MS) based acoustic model for speech generation in various prosodies and timbres , by classifying and reconstructing voice source into different types. The model enhances the quality of speech synthesis even with strong changing of the speaking mood. It is important for future research on personalized and embedded speech synthesis system.
  • CHEN Gao-peng,HU Yu,WANG Ren-hua
    2004, 18(3): 82-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper , aided by experiments and data analysis , improves and realizes the pitch target model which regards syllable as a basic linguistical unit. The F0 contour of a syllable is the representation of a result that a hidden target and environment interact. A useful model is realized automatically by data mining. In this paper it is proposed that the target of a syllable is independent on speech speed while it is effected by the linguistic environment. The real pitch is the approximation of the target effected by the preceding and following F0. This paper's content is how to hypothesize the a reasonable target , how to implement the parameters' auto-exaction and how to realize machine - learning of the model. The prediction and resynthesis of a completed utterance is realized successfully. The test result shows that RMSE is 22 Hz , correlation is 0.72.