2005 Volume 19 Issue 5 Published: 15 October 2005
  

  • Select all
    |
    Review
  • Review
    WANGBin ,PAN Wen-feng
    2005, 19(5): 3-12.
    Abstract ( ) PDF ( ) Knowledge map Save
    The volume of junk emails on the Internet has grown tremendously in the past few years and is causing serious problems. Content2based filtering is one of the mainstream technologies used so far. This paper aims to provide an overview on the state of art in this research field , including benchmark corpora , evaluation methods and filtering approaches. Many filtering approaches , including Ripper , Decision Trees , Rough Sets , Rocchio , Boosting , Bayes , kNN , SVM and Winnow , are discussed and compared in this paper. The experimental results show that some approaches , such as Boosting , Flexible Bayes , SVM, Winnow , can achieve very good results on research corpora. However , much more work should be done for practical use.
  • Review
    DAI Liu-ling ,HUANG He-yan , CHEN Zhao-xiong
    2005, 19(5): 13-17,25.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper suggests an on-line incremental learning algorithm based on RBF SVMs for text categorization problem. By exploiting the locality of RBF kennels , our algorithm updates current SVM using a subset of possible support candidates both in certain neighborhood of the new coming document and in a possible band. The size of subset is decided adaptively and efficiently by using ofξa generation error estimator on all the available training samples to qualitatively estimate the generation error rate. We also use an evolutionary factor of generation ability to make resulting SVMs adaptive on classifying precision and guarantee the generation ability of them. Comparative experiments on real-life TREC - 5 corpus show that our algorithm can remarkably accelerate the process of incremental learning while retains the classifying precision.
  • Review
    GUO Li , ZHANGJi , TAN Jian-long
    2005, 19(5): 18-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a text vector space model (VSM) based on suffix tree and implement a text categorizing systemon the model . The model can performfast matching by the support of suffix tree , obtain the vector presentation of text and avoid the complex computation such as word segmentation or feature extraction of the text. In addition , this model can guarantee that the alteration of the training set can affect the result of classification in real time. Experiment and analysis of the algorithm show that , the time complexity of text preprocessing in our system is O(N) , which is much better than that of word segmentation method. Besides , the avoidance of word segmentation and feature extraction shows that the categorizing process is irrelevant to do with the concrete language and is a language independent method.
  • Review
    XUE Yong-zeng ,YANGMu-yun ,ZHAO Tie-jun ,HAN Xi-wu ,QI Hao-liang
    2005, 19(5): 26-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    A method of sports-domain-oriented sentence skeleton translation is presented in this paper for effective translation of sports texts , especially long sentences. This method applies templates to represent the skeleton of a sentence , including three procedures : skeleton parsing , template transfer and sentence skeleton generation. Templates are carefully designed and acquired according to linguistic features of sports domain. In generation , a hybrid strategy is applied to incorporate the full translation of phrases into the translation of sentence skeleton , using rules and templates respectively. Translation functions are also induced to deal with inflection. Experimental results show that sentence skeleton translation is able to grasp key information of a sentence and has a better intelligibility than full translation as well as a satisfactory fidelity. Therefore sentence skeleton translation is an efficient method for sports domain translation.
  • Review
    ZHANG Yan ,KASHIOKA Hideki
    2005, 19(5): 33-38,60.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , we present a new approach to align Chinese-English sentences in the parallel texts. This approach is mainly based on statistical approach , here is length-based alignment approach , and simultaneously considers lexical information from the bilingual lexicon. Punctuation-based approach is the post-processing for alignment. This extended approach not only avoids complicated Chinese processing further , such as segmentation and part-of-speech tagging , but also uses some Chinese key words in the statistical approach to improve accuracy of sentence alignment. The bilingual corpus in this paper is LDC parallel texts in Hong Kong newspaper. Then dynamic programming algorithm is used to accomplish the alignment processing. Compared with length2based approach and lexical approach , our approach improves the alignment accuracy and the experiment result is desirable.
  • Review
    YUAN Yu-lin
    2005, 19(5): 39-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper demonstrates how to establish the matching relation between event2template of an information extraction (briefly , IE) system and the argument structure of the related verbs basing on the analysis of the succession texts according to [2 ] testing data for his IE system InfoX. It firstly divides the succession verbs into six classes (appoint , hold , remove , resign , dispatch , transfer) according to their syntactic and semantic features. Then it describes the argument structure of these six classes of verbs , especially the thematic role of the arguments and their syntactic arrangement. Finally , it establishes the matching relation between succession event2template elements and the argument roles of these six classes of succession verbs , reveals the orientation function of the related verbs in screen and merger of texts , and illustrates the possibility of developing a verb2driven approach of IE.
  • Review
    SHI Hai-hu ,XING Xuan-yu ,LI Dong-mei
    2005, 19(5): 46-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    As the development of artificial intelligence , the research on intellectualizing technology of man-machine interview has been becoming the current hot one , and the knowledge representation is one of the most difficult problems in the manmachine interview domain. Among all types of knowledge representation ,the framework representation has been widely used for its features of applicability , summarizing , structuring and reasoning. Comparing with past framework representation , we presented an extended semantic framework representation method , which can deal with knowledge treatment , common-sense reasoning and sentence building on the special domain of the basketball-theme interview research. This type of knowledge representation method can satisfy with the requirement of restricted man2machine interview system.
  • Review
    HUA Sha-bao , Dabhurbayar
    2005, 19(5): 54-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Boundary Determination of Mongolian BaseNP is an exploratory task based on POS Tagged Mongolian Corpora. The determination of the inner structure of baseNP will be very helpful for BaseNP boundary recognition. The inner structure of BaseNP can be analyzed based on different features , among them , POS tagging information is the most important feature. Using POS tagging information as the core feature , together with other determinative conditions , we construct a rule set for Mongolian BaseNP recognition , which will be a necessary resource for BaseNP recognition.
  • Review
    RUI Jian-wu ,WU Jian , SUN Yu-fang
    2005, 19(5): 61-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since Tibetan text is not only spelled from left to right but also Tibetan consonants can be overlapped vertically ,it’s more difficult to display a sting of Tibetan text correctly and legibly. No operating system can fully support Tibetan at present. Based on Tibetan character set defined in ISOPIEC 10646 and Tibetan orthography , issues about implementing a Tibetan operating system are discussed. They are involved in character set , encoding scheme , storing format , input and presentation of Tibetan text. It is considered as a bottleneck to render Tibetan vertical stacks. To address this issue , a Tibetan text engine based on OpenType font technology is presented , which receives Tibetan syllables and partitions it into vertical stacks. The tests applying to Qt prove that it is rather feasible to implement a Tibetan operating system based on ISOPIEC 10646.
  • Review
    JIA Juan ,CHEN Kun-qiu,ZHOU Dong-hao
    2005, 19(5): 69-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    Detecting reading order for text layout excluded by image is a key problem in document image understanding (DIU) and text typesetting. Especially in Chinese and other orient languages , text region in which words are reflected to next line when they meet a graph boundary makes reading order various. A new layout model , which uses a new page object called PMRegion , is defined. Based on ordered tree , an algorithmfor reading order detection after page top-down decomposition for constructing layout objects is presented. They are proven be effective by a special typesetting system and also helpful to go deep into DIU.
  • Review
    JIN Jian-ming , DING Xiao-qing , PENGLiang-rui , WANG Hua
    2005, 19(5): 78-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Uyghur is spoken in Xinjiang Uyghur Autonomous Region of China , which adopts Arabic script to write. As a cursive script and other characteristics , it is very difficult to do text segmentation and recognition. In this paper , a method ,which hybrid horizontal projection and connected components analysis , based on connected components classification is proposed to do text line segmentation and word segmentation of Uyghur texts. And then , the baseline position of each word is estimated. All candidate character segmentation points are found out by calculating the distance between word contour and baseline. Finally , over2segmented characters are merged according to rules. Experiment shows that the character segmentation accuracy has achieved 99 %.
  • Review
    YAO Yan-dong ,WU Jian , SUN Yu-fang,Husele
    2005, 19(5): 86-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    As we process mongolian scripts in computers based on ISOPIEC 10646 and Unicode standards , there is a bottleneck problem that variations of presentation characters have no definite code points. It is why many software systems processing mongolian scripts are produced in repetition and are incompatible with each other. This paper illustrates methods of implementation of mongolian scripts processing complying with Unicode standard. Firstly , we analyze the characteristics of mongolian scripts , and point out the difficulties of processing them. Then the OpenType font technology , which can satisfy the requirements of mongolian languages processing , is introduced. Lastly , we illuminate the workflow of Mongolian Layout Engine.
  • Review
    LI Pei-feng , ZHU Qiao-ming , QIAN Pei-de
    2005, 19(5): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nowadays , in the computer the Chinese Characters are represented by various code pages , and it is a long existing phenomenon. In order to use all kinds of Chinese code pages including GB2312 , GBK, GB18030 , BIG- 5 , HKSCS and ISO10646PUnicode at same time , the technology of Chinese code pages auto recognition is required. The Chinese screen real- time paraphrase engineer is the key technology to build many kinds of online dictionary , teaching software and so on. This paper describes the system architecture of the Chinese Screen Real - time Paraphrase Engineering , which is based on the technology of Chinese code pages auto recognition and auto capturing words from screen. It also illuminates the design of data dictionary and the key technology of such engineer. In an online dictionary which used this engineer as a sample , the recognition rate of short string’s code pages can reach 99 % on the test documents which include about five million Chinese characters.
  • Review
    REN Jun-ling ,GUO Jun
    2005, 19(5): 99-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    HCL2000 is one of the most influential handwritten Chinese characters databases. In order to research the nature features of handwritten Chinese characters , the files of database are organized in the mode of the writers. But this formof the files organization is not always the most effective in other researches such as the research on pattern selection. By this reason , a new model of characters database is developed. Based on the new model and HCL2000 ,a newly edited version of HCL2000 - HCL2004 is developed by reorganizing and revising the samples. Then two experiments are arranged. One is focused on the effect of the number of the training samples. From this experiment , we can see the relation of the number of the training samples and the system performance. And for 3755 characters classes , to achieve the optimal system performance need 300 training samples of each character. The other experiment in the paper is about the selection of the training and testing samples.