2003 Volume 17 Issue 3 Published: 16 June 2003
  

  • Select all
    |
  • MENG Yao,LI Sheng,ZHAO Tie-jun,CAO Hai-long
    2003, 17(3): 2-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Choosing the statistical model is the key problem in statistical parsing. Statistical model lies in the core of NLP parsing. This paper investigates 4 primary statistical parsing models , namely PCFG, history-based model , cascading parsing model and head-driven parsing model , and compares their performances in a 10000 Chinese treebank. The analysis based on the experiment were shown in the paper. The comparative study of these models can be exploited to build the practical and effective Chinese parser.
  • LUO Sheng-fen,SUN Mao-song
    2003, 17(3): 10-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word extraction is one of the important tasks in text information processing. A conventional scheme for word extraction is to estimate the soundness of a candidate character string being a word by the internal associative strength among characters involved. In this paper , the authors at first test the performance of nine widely adopted statistical measures of such kind in Chinese word extraction on the individual basis , then try the possibility of improving the performance by properly combining these measures. Genetic algorithm is explored to automatically adjust the weighting of combination. Experiments focusing on two-character Chinese word extraction show that mutual information is most powerful in these measures , achieving the F-measure 54.77% , and the effectiveness of combination is not significant , only achieving the F-measure 55.47%. This suggests that these measures could not supplement well each other , and the simplest and effective way in Chinese word extraction would be using mutual information directly.
  • CHEN Yu-zhong,LI Bao-li,YU Shi-wen
    2003, 17(3): 16-21,66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word segmentation for Tibetan has not been well studied yet . This paper reports a Tibetan word segmentation system that we designed and implemented. Several issues about the system are explained , which include system architecture , knowledge bases , segmentation strategy , and algorithms. In preliminary experiments , the system demonstrates higher accuracy and domain independency.
  • SUN Xue-gang,CHEN Qun-xiu,MA Liang
    2003, 17(3): 22-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the ceaseless resource inflation and rapid change of information on Web , it has become difficult to manage vast e-data through traditional manual method. Web clustering can automatically classify documents and help us to discover new information. Considering the complexity of Web documents , we offer a method of feature re-select and document re-cluster and perform a good Web clustering.
  • ZHANG Min,MA Qing,MA Shao-ping
    2003, 17(3): 28-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , we introduce self-organizing Chinese semantic map , then study and propose six different approaches of feature encoding which is crucial to the performance of a SOM. The approaches are based on set theory , algebra , and probabilistic theory respectively. We conclude from the evaluation results that the method of combining frequency density approach and TFIDF approach has the best performance with 94.4% of precision and 90.7% of recall on semantic mapping , and vector space oriented approaches are not suitable for the task. Analyses of results are also given. Comparative experiments show that the best approach in this paper is better than conventional hierarchy clustering technique , and much better than multivariate statistical analyses such as principle component analyses on dimension reduction based feature encoding.
  • CHEN Bo-xing,DU Li-min
    2003, 17(3): 35-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Machine Translation Units extracted from the bilingual corpora can cover the natural language text even more. This paper will describe an algorithm for obtaining the Machine Translation Units by learning the Similarity and Difference that are not all high frequency function words from two bilingual sentence pairs and aligning the Similarity parts and Difference parts by utilizing the Translation Lexicon and Dynamic Programming approach. Then , the Bilingual Chunk Similar Score Filter and the Part of Speech Similar Score Filter are used to test whether the meaning and syntax of the source part of the Machine Translation Unit is corresponding to the target part of the Machine Translation Unit ; finally , the Begin and End Stopword Filter is applied to check whether the Machine Translation Units’collocations are correct or wrong. We get an 86% precision and 61.34% recall. This algorithm provides a new practical approach to get Machine Translation Units.
  • CAO Jian-fen
    2003, 17(3): 42-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is especially necessary to generate prosody automatically in a Chinese TTS system. The main task is to segment the Chinese character sequence into proper speech units and organize them as a prosodic hierarchy. This goal can be satisfied by inserting different strength of breaks and assigning certain degree of stresses within a sentence. This paper will discuss how to predict the location and strength of break and stress based on syntactic and lexical information extracted from text analysis. Our attention will be paid to following aspects : (1) a brief description on Chinese prosody ; (2) text analysis ; (3) predictive-tree building ; (4) break index prediction ; (5) stress index prediction.
  • LIU Yu-yu,WU Ji,WANG Zuo-ying
    2003, 17(3): 48-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    It has great impact on the recognition performance how well HMM observation density can describe the actual distribution. To compare triphone models under different observation densities in Chinese language , three models and their respective algorithms of training and recognizing are constructed. By comparison of these three models in terms of different aspects , a conclusion is drawn , which can be the basis for the future selection of triphone observation density in Chinese language.
  • WU Bing-ya,ZHOU Chang-le,WU Jie-min
    2003, 17(3): 54-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper achieves the tonal template of Mandarin discourse , by judging from the emotional color and colloquial style of every word in the input discourse and using a kind of phraseological attribute grammar and the relevant combinational arithmetic. The speed and scale of the machine synthesized speech are adjusted by the basic value of the pitch and duration of syllables corresponding to the tonal template , therefore the naturalness and fluency of the machine synthesized speech are enhanced , i. e. the quality improvement is realized.
  • ZHANG Xiao-heng
    2003, 17(3): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Chinese character input , the form-based coding method is an indispensable complement to the Pinyin-based method. The former is preferable in the cases where high-speed input is needed , where a large character set is required , where words of single characters or words missing in normal dictionaries are abundant , and where unfamiliar or rarely-used characters are more frequently used. The present paper introduces ZYQ , a stroke-group-based Chinese character input method whose development has been kept under the guidance of being Correct (in respect to the norms of language education and language application) , Easy (in respect to user friendliness and convenience) and Complete (in respect to the Chinese character set available) . ZYQ has a key-selection rate of 16.4% , while the maximum and average code lengths are 5 and 4.315 respectively.