2007 Volume 21 Issue 3 Published: 15 June 2007
  

  • Select all
    |
    Review
  • Review
    ZHANG Bo
    2007, 21(3): 3-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we will discuss the computational models of natural language processing. There have been several kinds of computational models such as analytical model, statistical model, hybrid model, etc; each has its own characteristics and limitations. As an ill-posed problem, we’ll discuss what the essential hardness the natural language processing has, what challenge we will confront with, and what measures we’ll adopted to solve the difficulty.
  • Review
    HUANG Chang-ning, ZHAO Hai
    2007, 21(3): 8-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    During the last decade, especially since the First International Chinese Word Segmentation Bakeoff was held in July 2003, the study in automatic Chinese word segmentation has been greatly improved. Those improvements could be summarized as following: (1) on the computation sense Chinese words in real text have been well-defined by “segmentation guidelines + lexicon + segmented corpus”; (2) practical results show that performance of statistic segmentation systems outperforms that of handcrafted rule-based systems; (3) the evaluation in terms of Bakeoff data shows that the accuracy drop caused by out-of-vocabulary (OOV) words is at least five times greater than that of segmentation ambiguities; (4) the better performance of OOV recognition the higher accuracy of the segmentation system in whole, and the accuracy of statistic segmentation systems with character-based tagging approach outperforms any other word-based system.
  • Review
    2007, 21(3): 21-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chunk parsing is an important technique in the natural language processing research community, whose processing basis lies in a suitable and efficient chunk scheme. In this paper, we proposed a new topology-based base chunk scheme for the Chinese language. After introducing the lexical cohesion relationships to determinate three basic topological structures, we formed a better set of principles to analyze the content cohesion of a base chunk and built an efficient bridge to link its syntactic form and semantic meaning. Based on the chunk scheme, we can greatly simplify the processing procedure to automatically extract useful base chunk annotated corpora and corresponding lexical cohesion knowledge from a large scale Chinese syntactically annotated corpus TCT. All these research work will lay good foundations for the further explorations to develop Chinese base chunk parser and lexical cohesion knowledge acquisition tools.
  • Review
    FU Lei, LIU Qun
    2007, 21(3): 28-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, discriminative re-ranking technique has been applied in many fields relative to NLP (Natural Language Processing), such as parsing, pos-tagging, and machine translation etc., and performs very well. We will take SMT as an example to explain how to re-rank the translation candidates using Simplex Algorithm in detail and give the experiment results on NIST-2002(development set) and NIST_2005(test set) Chinese-to-English test sets. Our experiments show that we can gain significant improvements in BLEU by re-ranking. It can provide 1.26% absolute increase in development set and 1.16% absolute increase in test set.
  • Review
    ZHANG Gui-ping, YAO Tian-shun, YIN Bao-sheng, CAI Dong-feng, SONG Yan
    2007, 21(3): 34-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual corpus is one of the most important parts in translation memory system. To extract more association examples which meet the present needs of users from limited scale of bilingual corpus is the main content of the research of translation memory technology. First of all, this paper analyzes the limits of the current example search method. Based on the knowledge representation of the bilingual corpus, this paper proposes multi-strategy based association example extraction mechanism, that is, to extract association example by using comprehensively the methods of tree matching, sentence edit-distance calculating, phrase chunk matching, lexicon semantic generalization, extended information based optimization (for instance, the information on sentence source, major belonged to, application frequency, etc.). Experimental results indicate that the method effectively improved the recall quantity and quality of association example and the assistant effect to users.
  • Review
    WANG Jin, CHEN Qun-xiu
    2007, 21(3): 40-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper we present a method to group adjectives according to their corpora distribution, based on the Machine Tractable Dictionary of Contemporary Chinese Predicate Adjectives. We describe how our system extracts three groups of information for each adjective, which includes: modified nouns, synonyms, and antonyms, and exploits this knowledge to compute a measure of similarity between two adjectives with help of literal similarity and route weight of each adjective to another adjective, which in some extent solve the problem caused by sparse data. We also show how a clustering algorithm can use these similarities to produce the groups of adjectives, and we present results produced by our system for a sample set of adjectives.
  • Review
    HONG Yu, ZHANG Yu, LIU Ting, ZHENG Wei, GONG Cheng, LI Sheng
    2007, 21(3): 47-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper adopts an adaptive learning algorithm based on hierarchy clustering to update user profile, which continuously abstract the cancroids of one class of optimum information from the feedback flow of system, which effectively shield the learning process from plenty of feedback noises produced by distorted threshold and sparseness of initial information, which also can imitate artificial feedback approximately to perfect the intelligence of adaptive learning mechanism.
  • Review
    ZHENG Hai-qing,LIN Chen,NIU Jun-yu
    2007, 21(3): 54-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic text categorization has become a very important research area. In most applications, there’s only a positive document set with a limited size and a large portion of unlabeled data in the training set while the distribution of the number of the positive set and the negative set is also unbalanced. So, this kind of text categorization task is different from those traditional ones which have not only labeled positive but also labeled negative samples in its training set. Those traditional classification methods can not be directly used in such tasks. This paper proposed a closeness-based method to solve this semi-supervised text categorization problem. It firstly extracts a reliable negative set from the unlabeled set, and then uses the closeness-based algorithm to enlarge initially extracted reliable negative set to a proper size. Based on the labeled positive set and the extracted negative set, the classifier will be constructed. This method will improve the performance of the classifier without any outside resources to help the feature selection, so, it can be used in a lot semi-supervised text categorization tasks in different domains. The experiment on TREC’05 Genomics track data shows that this algorithm performs well in this kind of text categorization tasks.
  • Review
    CHEN Zhi-xiong, , CHEN Jian, MIN Hua-qing
    2007, 21(3): 61-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Associative classification, which uses association rules in training set to predict the class label for new data object, has been recently reported to achieve higher accuracy than traditional classification approaches like C4.5. The exiting works which are based on support-confidence framework only select the frequent literals to construct classification rules, ignoring the contribution of literals’ classificatory effects. In this paper, a novel associative classification algorithm, named ACIG, is proposed to integrate the effect of information gain and FoilGain for selecting the literals of rules from Chinese text, in order to improve the qualities of literals. Our experimental results show that ACIG outperform other associative classification approach (CPAR) on accuracy.
  • Review
    SHI Jing, HU Ming, DAI Guo-zhong
    2007, 21(3): 69-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper aims to perform topic spotting of segments based on text segmentation using small world structure. Main topic of the whole text is generalized and the skeleton of text shows itself. It is explained that the term co-occurrence graph of text is highly clustered and has short path length, which proves that texts have small world structure. Clusters in the small world structure are detected. The density of each cluster is computed to discover the boundary of a segment. Topic words are extracted from clusters of the graph. Words which are not distinctly in the analyzed text can be included to express the topics with the help of word clustering of background and topic words association .The signification behind the words are attempted to dig out. Although much research on applications of small world structure, it is a new task to analyze texts with the characteristics of small world. The experiments tell that the result of tests is far better than that of other methods. Valuable pre-processing is provided for next work of text reasoning.
  • Review
    WANG Yu, WANG Zheng-ou, BAI Shi
    2007, 21(3): 76-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nearest neighbor classification assumes locally constant class conditional probabilities. The assumption becomes invalid in feature space with high dimension. When KNN classier is used in feature space high dimension, severe bias can be introduced if the weights of features are not amended. In this paper, initial weights of text features are acquired based on sensitivity method firstly, and the second dimension reduce is done. Then training samples are divided into many groups based on sample similarity and the initial weights by using SS tree, k0 approximate nearest neighbors of unknown sample are acquired by using SS tree. Weights are computed again based on k0 approximate nearest neighbors and chi-square distance theory. K nearest neighbors are acquired based on new weights. Little time is spent, but the better accuracy of text categorization is acquired.
  • Review
    XIA Yun-qing, Kam-Fai Wong, ZHANG Pu
    2007, 21(3): 83-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Network chat language becomes ubiquitous due largely to the rapid proliferation of Internet applications. Online chat now acts as am important role in human communication, which in turn makes Network chat language popular. Network chat language processing is important but difficult. The challenges mainly come from the anomalous and dynamic nature of the new text genre. The two distinct features of Chinese Network chat language are investigated and analyzed in this paper. Methods seeking to address the two features in Network chat language processing are also proposed. We first develop a source channel model to convert chat language to standard language. Unfortunately this method relies too heavily on chat language corpus rendering the method poor in addressing the dynamic nature. We propose to introduce phonetic mapping model constructed with standard language corpus to the source channel model. The extended method is proved effective in addressing the dynamic issue by our experiments.
  • Review
    LI Bin, CHEN Xiao-he
    2007, 21(3): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word segmentation(WS)is a funamental task in Chinese information processing. To solve the difficulties of traditional methods in processing texts in restricted domains, a novel method is proposed. It requires no lexicon or training corpus and can adapt to various texts and different WS standards. It enables the user to take part in WS procedure and add language kownledge to the system. Using optimized suffix array algrithm, candidates as words are recursively extracted from the text, then judged and edited by the user. Thus, a lexicon of the text is gained and applied to segment the text. Experiments on 4 different texts show that without the user’s judgement, F-score of the system reaches as much as 72%, and can be prompted by 12% with amount of work done by the user. With the increase in the workload of the user, the system is able to achieve better results.
  • Review
    LI Feng ,LI Fang
    2007, 21(3): 99-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    A basic approach for measuring semantic similarity/distance between words and concepts is to use lexical taxonomy, such as Wordnet. Hownet is a Chinese semantic dictionary, containing abundant semantic information and ontology knowledge, but has quite different construction and architecture. In this paper, we present a new approach using Hownet by drawing in the idea of information theory. We propose that the more semantic information a “sememe” take, the more powerful it in describing concepts. Then we divide “sememe” which describes a concept into two set: directly describing part and indirectly describing part. In the experiments, we demonstrate our method have improved performance in measuring semantic similarity between Chinese words.
  • Review
    REN Jun-ling
    2007, 21(3): 106-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the process of training, some patterns are indispensable because they describe the characteristic of the class, but other patterns are dispensable. Sometimes, with these patterns the system performance even gets worse. So it is necessary to select the training patterns and find a more representative pattern subset. In this paper, a definition of the boundary patterns based on the generalized confidence is given, and a new algorithm of pattern selection is founded on this definition. According to the experiments on the offline handwritten Chinese character database HCL2004, the pattern subset selected by these algorithms have less patterns than the original set, but the system performance based on the subset is improved. Then the validity of the definition and these algorithms is approved.
  • Review
    JIE Meng-en, WU Jian, JIA Yan-min, LV Yuan-hua,
    2007, 21(3): 111-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Non-BMP characters in the Unicode Standard are mostly used in the study of ancient books (e.g. CJK Ext-B), or in representing the scripts of minorities (e.g. Tibetan Ext-B). Thus, their users are rare, and lots of software, including software for office, fails to support them. Based on OpenOffice.org, this paper firstly analyses its state of arts of supporting non-BMP characters. Then several key questions, which should be taken into account while supporting non-BMP characters, are discussed. Furthermore, we provide reasonable solutions to each question mentioned above. Finally some samples in CJK and Tibetan are given to show the effects of OpenOffice.org after being enhanced.
  • Review
    LIU Wei, ZHU Ning-bo, HE Hao-zhi, LI De-xin,SUN Fa-jun
    2007, 21(3): 117-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    The directional feature is considered suitable for handwriten chinese character recognition,and it has been widely used as one of the main feature extraction method.Meshing method is one of the key factors of meshing direction feature.According to stroke distributing characterisitic and topologic correlation of chinese characters,we present a new method based on elastic mesh and related fuzzy feature.A more stable feature vector with more information is extracted. The experiment based on the handwritten legal amount on Chinese bank check shows that the method is more effective than other meshing direction features,and recognition rate has been up to 97.64%.
  • Review
    ZHU Wei-bin
    2007, 21(3): 122-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    To aim to predict and realize Chinese accent in a unit-selection based speech synthesis system, a data-driven method was used to build an accent-supported prosody module. First, with the help of Accent-Index detector which had been optimized with perceptual annotations, a speech corpus had been auto-annotated with Accent-Index. Then, a prosody predictive module supporting accent had been trained with the corpus. Replaced with the new prosody predictive module, the speech synthesis system could synthesize speech with various levels of accent. The results on the experiments had proved the accuracy of auto-detected accents, and the validity of the prosody predictor, and also the capability of accent realizing of the speech synthesis system.