2005 Volume 19 Issue 6 Published: 15 December 2005
  

  • Select all
    |
    Review
  • Review
    HU Ri-le , ZONG Cheng-qing , XU Bo
    2005, 19(6): 3-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , we propose a new approach which automatically acquires translation templates from the unannotated bilingual spoken language corpora. This approach is an unsupervised , statistical , data2driven approach. In the approach , two basic algorithms named grammar induction algorithm and alignment algorithm using Bracketing Transduction Grammar are adopted. First , the semantic groups and the phrasal structure groups are extracted from both the source language and the target language. Second , the alignment algorithm based on Bracketing Transduction Grammar aligns the phrasal structure groups. The aligned phrasal structure groups are post-processed as the translation templates. The preliminary experimental result is show that our algorithm is effective and practical.
  • Review
    CAI Zang-tai ,HUA Guan-jia
    2005, 19(6): 9-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine Translation System (MTS) is a typical nature language disposal system , and language technique is a main technique inMTS. AppliedMTS commonly adopts the translation measure with restrained language and based on a certain rules as a main measure. Combining with the research practice based on the 863 project —Banzhida Chinese-Tibetan document machine translation system , this paper discusses the principle which combined both word information and syntax rules. It also advances the dichotomy of syntax analysis focuses on verb. Accordingly on the range of restrained language , this paper afford a useful method to create a machine translation rule which has high adaptability and to effectively advance the efficiency of MTS’syntax analysis.
  • Review
    QIN Bing , LIU Ting , LI Sheng
    2005, 19(6): 15-22,58.
    Abstract ( ) PDF ( ) Knowledge map Save
    multi-document summarization is a technology of natural languages processing , which extract important information from multiple texts about same topic according to ratio of compression. Multi-document summarization becomes new research spot with increasing of information in internet. In this paper ,the background of multi-document summarization is introduced , the relationship with other technologies of natural language processing and the state of arts is analyzed , the key technologies and the methods of research of multi-document summarization are proposed. Finally , the feature of multi-document summarization is forecasted.
  • Review
    ZHAO Shi-qi ,ZHANG Yu , LIU Ting , CHEN Yi-heng , HUANG Yong-guang , LI Sheng
    2005, 19(6): 23-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature selection is one of the key problems in text categorization. The chief obstacles to feature selection are noise and sparseness. This paper presents a novel feature selection method which is based on class feature domains. First , we will make use of the combined feature selection method[1 ] to remove noisy features from the original feature space and extract candidate features. That is , we’ll take off low frequency words using Document Frequency method firstly and then select candidate features using Mutual Information method. Then , we will construct a class feature domain for each class and conquer the sparseness of trainning datas by merging and strengthening the candidate features which appear in the class feature domains. Experiments show that our method is much better than kinds of traditional feature selection methods and it can improve the performance of text categorization systems markedly.
  • Review
    FU Jian-lian ,CHEN Qun-xiu
    2005, 19(6): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of network ,electronic text grows rapidly. Since automatic abstraction is superior to manual abstraction for its speed ,convenience ,efficiency ,and impersonality. It has wide applications and such research is becoming a hot topic. Topic partition is a significant problem during text structuring in automatic abstracting system. The paper establishes vector space model for the whole article based on paragraph , then proposes an algorithmfor multi-topic text partitioning based on sequential paragraphic similarity. It solves the problem of chapter structural analysis in multi-topic article and makes the abstract of the multi-topic to have more general content and more balanced structure. The experiment on close test shows that the precision of topic partition for multi-topic text and single-topic text reach 9212 % and 9911 % respectively.
  • Review
    WANGJun
    2005, 19(6): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    The application of thesauri in digital libraries is seriously constrained by the manual nature of current thesaurus maintenance mechanism which cannot keep up with the rapid evolvement of knowledge. This paper proposes a statistical method of extracting new terms from titles of metadata and settling them into the thesaurus. The settlement is based on the subject indexing coded in the metadata records. An experiment was conducted on the Chinese Classification and Thesaurus and a corpus of 5 thousands bibliographic data of computing domain. The successful result demonstrates that the techniques proposed are effective and can be applied to the corpus of large size and foreign language.
  • Review
    ZOU Juan , ZHOU Jing-ye , DENG Cheng , GAO Nan-sha
    2005, 19(6): 46-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new method for synonymous processing in feature word extraction of text categorization is proposed in this paper.Fully considering the difference among synonyms in texts of different types , this method can calculate the membership degrees of feature words in their common synonymous concept automatically while training , so that we can define synonymous concepts with rough sets. Then we use synonymous concepts to extract feature values in texts. In addition , we process the polysemous problem using rough sets. The algorithms of the system are presented in the paper. And the results of the comparing tests show that our method improve the correct rates of text categorization effectively and the system is more automatic and more portable.
  • Review
    LIU Dong-ming ,YANG Er-hong ,FANG Ying
    2005, 19(6): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Taking full advantage of the computability of the concept in the HowNet , this paper changes word sense tagging in Chinese - English parallel corpora into the similarity calculation between the concept combinations of the aligned sentences of the two languages. At the same time , the dynamic planning thought is used in order to reduce the time complexity of the algorithm. The current word sense tagging method in parallel corpora only used the context of the single ambiguous word and alignment information , but this method can take into account the all words’context in the aligned sentences together. In this way it can settle the problemfrom the viewpoint of the whole sentence and achieve the satisfactory result.
  • Review
    YOU Li-ping , FAN Kai-tai , LIU Kai-ying
    2005, 19(6): 59-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    The purpose of this comment is to provide the research of semantic representation of Chinese sentences with a reference. First , we review three presently popular models in semantic representation of Chinese sentences in the respects of their theoretical basis and representing methods , which are Word Dependency (WD) based on Dependency Grammar , Conceptual Dependency (CD) based on Conceptual Dependency Theory and Kernel Dependency (KD) based on Frame Semantics. Then we make more effort to compare their features for semantic representation. The result is that (1) WD is relatively easy to realize but its function is quite limited. The function of CD is quite well but it is hard to realize. Both of them have fatal problems. However , KD refers to both words and concepts and could be the best at semantic representation of Chinese sentences. (2) There is much more and comprehensive work in the realization of the models , such as syntactic parsing , lexicography and standardization.
  • Review
    LIU Yun-feng , QI Huan , Xiang’en Hu , Zhiqiang Cai
    2005, 19(6): 66-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since the first paper about Latent Semantic Analysis (LSA) was published , LSA has been applied to many fields ,such as information retrieval , text classification , automatic question answering , etc. . One important factor that affects the quality of LSA is the weighting scheme to the term - document matrix. In this paper , we first summarize the traditional and well - studied methods of weighting , including local weighting and global weighting. We then point out some inadequacy of original methods , modify these methods , and present the concept of global weighting of document. In the last part of this paper , we construct an experiment to compare the results of LSA with different types of weighting , in which we present a new measure to evaluate the result of LSA. We call this new measure self - indexing matrix. The result of the experiment confirms that the modified method of weighting can improve the efficiency of retrieval.
  • Review
    YIN XU-Cheng, J IANG Shi-sheng , HAN Zhi , LIU Chang-ping
    2005, 19(6): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently , financial document analysis and recognition is a hot research topic , and form classification is one of its fundamental parts. In this article , we introduce a hierarchical method for classifying financial documents using a binary tree decision. First , form classification is based on elastic matching of form structure shape. Then , OCR of document titles is performed. Thirdly , document color is re2confirmed. As a result , the sequent range of document types becomes more and more tighter. At last , the final decision of document types is performed by linear combination of the first two classifiers. Applications of financial document recognition systems based on this form classification method have been successfully and widely used.
  • Review
    HU Wei-xiang,DONG Hong-hui,TAO Jian-hua,HUANG Tai-yi
    2005, 19(6): 80-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Restricted by prosody hierarchy and disturbed by tone and intonation , it is a hard task to detect the stress of Chinese speech automatically. In this paper , aiming at automatic stress perception in normal mandarin reading speech , we studied some acoustical measurements based on F0 , duration and intensity and proposed a novel model to calculate the stress of each syllable. With a structure of classify tree , the model combined the restriction of tone context and prosody hierarchy effectively. It was shown from the result that the top line of pitch , pitch range , duration are important cues for stress perception. The model we developed can detect 80 % accent syllable from corpus.
  • Review
    FU Yue-wen,DU Li-min
    2005, 19(6): 86-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Under the decoding strategy of using stack decoding to rescore the word trellis to generate final output , this paper uses decision tree to combine multiple predictors to identify each of recognition output words as correct or incorrect. A series of predictors are constructed , including word posterior probability , word length , word posterior probability of neighboring words , 13 in all. Optimal combination of predictors is found and best decision tree is constructed for correct-incorrect classification of output words by testing different combination of predictors and choosing appropriate tree parameters. The experimental results show that the combination of local word posterior probabilities (LWPP) with some of other predictors constructed by this paper , including mainly word length and LWPPs of neighboring words , can give a significant improvement in classification performance , and is better in time consumption and quality than the corresponding results from n-best list. Compared with baseline system , the classification error rate getsan improvement of 41.4 %. The experimental results also show that posterior probabilities of neighboring words proposed by this paper are among relatively important predictors.
  • Review
    GAO Ding-guo , GONG Yu-chang
    2005, 19(6): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Arranging Tibetan on a keyboard is a crucial step to input Tibetan code. The better way to resolve the problemof Tibetan component more than available key is merging more components onto one key , but will bring out repeated code. In this paper we use the optimal design method based on graph theory and probability to arrive the maximum independent sets of coding components , and to reduce repetition of codes to absolute minimum. The algorithm to find maximum independent sets of Tibetan coding component and method extracting contradictory coding components are presented in the paper. Then complying with the principles of engineering psychology we have arranged 169 Tibetan characters onto standard keyboard. Not only could easily input Tibetan character efficiently , but also could input Tibetanized Sanskrit.
  • Review
    LI Liang-yan , HE Zhong-shi , YI Yong
    2005, 19(6): 100-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    Literary language processing deserves its due attention in the current research atmosphere of Natural Language Processing (NLP) . Since poetry fully reveals literary language features such as vividness , sensibility and individuality , it is the appropriate start2point in NLP. Stylistic analysis thus contributes as an important task in literary language processing with lots of challenges. This paper looks into the research object , poetic language , strongly recommends and carefully proves poetry stylistic analysis technique based on term connection with supports of NLP technique as the background. Further more , the corresponding algorithm is proposed and questionnaires are applied to evaluate poetry stylistics in surveys. Both theories and experiments confirm us that commonness exceeds individuality concerning poetry stylistic analysis , and therefore poetry stylistic analysis technique based on term connection is valid in evaluating poetry stylistics.