2008 Volume 22 Issue 5 Published: 15 October 2008
  

  • Select all
    |
    Review
  • Review
    ZHANG Gui-ping, CAI Dong-feng
    2008, 22(5): 3-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the conclusion and reflection of the tough development of Machine Translation (MT), this paper presents a new thought of the integration of Knowledge Management (KM) and Machine Translation (MT), which centralizes on a user model. In July 2008, Beijing, the technology has been identified by the Identification Committee organized by Chinese Information Processing Society, and the committee agreed and announced that “GE-Soft has successfully accomplished their work—the Cooperative Translation Platform based on Knowledge Management and Intelligent Control Technology, which is developed from their national 863 project of the Integration of Machine Translation and Knowledge Management. The research has achieved the international leading level by using KM technology to implement human-computer mutual cooperative translation”. This paper describes the thinking and method, design and realization, analysis and application, and process and prospect of the platform.
  • Review
    SU Xin-chun
    2008, 22(5): 12-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Thesaurus of Modern Chinese (TMC) inherits the tradition of concept classification since Synonym Dictionary to reflect conception relation of the whole society and human recognition. It embodies more than 80 000 modern Chinese words with high frequency and constructs a Five-levels semantic classification system with 9 classes in the first-level, 62 in the second-level, 518 in the third-level, 2 076 in the forth-level and 12 613 fifth-level classifications. This kind of semantic classification emphasizes on the governing function from upper semantic levels to subordinate levels, the coverage function of the subordinate semantic levels to the upper levels and the complementary function between the neighboring semantic levels.
  • Review
    WANG Meng, YU Shi-wen, DUAN Hui-ming, SUN Wei-wei
    2008, 22(5): 22-29,38.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the preliminary research on probabilistic grammatical characteristics of noun in contemporary Chinese based on the POS tagged corpus of People Daily. The grammatical characteristics which show the relationship between numeral, classifier and noun are firstly discussed. The conception of “Distribution Degree” is proposed to analyze the “Numeral-Noun” structure quantitatively. Also, the distribution of classifiers which can collocate with a certain noun is investigated. Finally, the experimental results are compared with the original attribute values in the Grammatical Knowledge-base of Contemporary Chinese, and the correctness of the dictionary is verified.
  • Review
    HUANG Xiao-jiang, WAN Xiao-jun, YANG Jian-wu, XIAO Jian-guo
    2008, 22(5): 30-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Comparison is a common kind of expression, and it is novel and substantial research to extract comparative relations between objects. Identifying comparative sentences in natural language is an important step in extracting comparative relations. To our knowledge, there is no research on identifying Chinese comparative sentences automatically. This paper first defines the problem of Chinese comparative sentence identification, and then proposes to use SVM to classify a Chinese sentence into either “comparative” or not. Various linguistic and statistical features have been explored, such as keywords and sequential patterns. Experimental results demonstrate the effectiveness of the sequential patterns, i.e. the classifier with sequential patterns can significantly outperform the traditional term-based classifier. We also empirically investigate the important factors that affect classification performance.
  • Review
    YANG Yong, LI Yan-cui, ZHOU Guo-dong, ZHU Qiao-ming
    2008, 22(5): 39-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Anaphora resolution plays an important role in natural language processing, which involves recognition of named entities, nominal phrase and pronoun anaphora etc. This paper presents a machine learning approach to anaphora resolution with special focus on the distance information between the anaphor and the antecedent candidate. Traditionally, the distance between anaphor and candidate is only adopted as a feature in machine learning approaches, without taking into account its contribution in the antecedent candidate generation. In this paper, the distance information is explored in details by either incorporating it as a feature in the learning algorithm (such as the maximum entropy model and the SVM model) or applying it as a hard constraint in the antecedent candidate generation. Evaluation on the MUC-6 benchmark corpus shows that proper handling of the distance information can much improve the performance and our system achieves the F1-measure of 68.7, which outperforms other similar systems.
  • Review
    JIA Yu-xiang, HUANG De-zhi, LIU Wu, YU Shi-wen
    2008, 22(5): 45-50,55.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese text normalization is the process of transforming non-Chinese character strings into their corresponding Chinese character strings to determine their pronunciations. The difficulties of this work mainly lie in two aspectstoo many non-Chinese character strings of various formats and their high degree of ambiguities. This paper develops an effective taxonomy of non-Chinese character strings with the concept of Non-Standard Words (NSWs). And then a three-layer normalization model is proposed, including NSWs detection, NSWs disambiguation and standard words generation. In the NSWs disambiguation stage, a machine learning method is employed to overcome shortcomings of rule-based method. Experiment results show that this approach achieves a high performance and adapts well to new domains. The accuracy of open test is 98.64%.
  • Review
    TAO Mei, Wushour Silamu, Nasirjan Tursun,
    2008, 22(5): 56-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Uyghur language is an agglutinative language belonging to Altai Turkic language. Based on the analysis of the characteristics of Uyghur, this paper designs the framework of Uyghur speech recognition system. It investigates the methods to select the best unit of Uyghur speech recognition and proposes to establish the context-dependent model based on decision tree cluster. It further adopts Gauss Mixed Distribution (GMD) as observation probability to optimize HMM model for a better recognition performance. Finally, the contrastive experiments are presented and the conclusions are summarized.
  • Review
    DU Wei, CHEN Qun-xiu
    2008, 22(5): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-strategy MT (Machine Translation) is a direction for machine translation system. This paper introduces the studies on certain key technologies in a multi-strategy Chinese-Japanese machine translation system. The system is made up by four sub systems; the Chinese analysis system using lexcical analysis, syntax analysis and semantic role labeling, The translation memory MT using double index technology, the Example-based MT (EBMT) using syntax tree segments as translation templates and the valence-based MT using valence models and partition analyse. The test result shows that 1) the TM system is efficient, 2) the EBMT get 99% translation accuracy under the close-test of 1 559 Chinese sentences and 85% accuracy under the open-test of 1 500 sentences and 3) the valency-based system get 89% accuracy under all 3 059 sentences.
  • Review
    YE Sha-ni, LV Ya-juan, HUANG Yun, LIU Qun
    2008, 22(5): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Parallel sentences are valuable resources for machine translation while not readily available in the necessary quantities and often domain limited. This paper constructs a system to automatically obtain parallel sentences of high quality from the Web. This system puts forward a method to find the similarity of URLs in bilingual websites, and also improves parallel sentence extraction technology. Experimental results show that this system gains a recall rate of 93% and a precision rate of 96% when collecting parallel sentences from test set. In addition, this paper makes preliminary research in collecting parallel sentences from bilingual contrast web pages.
  • Review
    YUAN Xiao-feng, QIU Xi-peng, WU Li-de, HUANG Xuan-jing
    2008, 22(5): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a List Question Answering method based on a phrase-retrieval model and an answer-ranking model. The retrieval model utilizes phrases as query words and the answer ranking model scores the candidate answer mainly through the distance between the candidate answer and other contextual words. The two models jointly offer an effective way to find more answers and better answers in the list question answering task. The experiment shows that our phrase retrieval model outperforms other retrieval models and our answer ranking model improves the F score significantly.
  • Review
    MAI Fan-jin,YE Dong-hai,SHI Hui
    2008, 22(5): 80-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    The accuracy of the Chinese Word Segmentation is crucial to Chinese spam filtering. After analyzing the techniques of the statistical-based and rule-based spam filtering, this paper designs a spam filtering model based on semantic understanding, which combines the research of semantic processing with the algorithm of spam filtering. It also proposes an improved word segmentation algorithm, which improves the efficiency and the accuracy of the word segmentation and the ability to identify the out-of-vocabulary (OOV) words. Finally, the experimental data indicates that the design of the spam filtering model based on the semantic, to a certain extent, resolves the issue of the words of “splitting” encountered in the spam filtering and the OOV words after in word segmentation.
  • Review
    JIANG min, XIAO Shi-bin, WANG Hong-wei, SHI Shui-cai,
    2008, 22(5): 84-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word similarity computing based on the “HowNet” of Liu-Qun is a representative method to compute the word similarity. But it is found that some words with contrastive or contradictive meanings are computed with high similarity compared those true synonymous. To resolve this defect for the word polarity analysis, we confine the value of word similarity between [-1, +1] in this paper, and enhance the word similarity computation on the basis of Liu’s paper by employing sememes’ depth information, the antonym and definition information of the sememe. This method produces a good performance in the word polarity recognition experiment, achieving 99.07% in accuracy and 99.11% in recall.
  • Review
    WANG Hong-xian, ZHOU Qiang, WU Xiao-jun
    2008, 22(5): 90-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The semantic relationship between words and concepts is very common and complex in natural language. In order to effectively integrate different lexical resources and construct a computable Chinese lexical resource, we propose an automatic construction method of lexical semantic relationship graph and apply it on HowNet. As a system of knowledge, HowNet records each concept by entries, and the semantic relationship is hidden between the entries. In order to extract the relationship between the concepts in HowNet, we first re-structure the concept entries into concept trees, and then extract the semantic relationship from concept trees and construct a lexical semantic relationship graph. Finally we get 589984 relations in 88 different kinds, with rich connections between the nodes in the graph. The work in this paper provides a solid foundation for the real text content computation based on lexical semantic relationships.
  • Review
    ZHENG Feng-qiang,LIN Lei, LIU Bing-quan, SUN Cheng-jie
    2008, 22(5): 97-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Named entity recognition is a foundational issue of natural language processing and of substantial significance to deep language processing. This work adopts the maximum entropy model for named entity recognition and proposes two improvement strategies based on HowNet to enhance the generalization of maximum entropy model. The first strategy is to add the HowNet’s sememe of concepts into the maximum entropy model as features. The other is to take advantage of HowNet to calculate the similarity between word features in maximum entropy model. The experiments on China Daily corpus show that the first strategy could improve named entity recognition performance significantly, while the second improves the performance trivially.
  • Review
    HUANG Rui-hong,SUN Le,FENG Yuan-yong,HUANG Yun-ping,
    2008, 22(5): 102-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity Relation Extraction is one of the important research fields in Information Extraction. This paper explores the effectiveness of two kernel-based methods, the convolution tree kernel and the shortest path dependency kernel, for Chinese relation extraction based on ACE 2007 corpus. For the convolution kernel, the influence by the different parse tree spans on the performance of relation extraction is studied. Then, experiments with composite kernels, which are a combination of the convolution kernel and feature-based kernels, are conducted to investigate the complementary effects between tree kernel and flat kernels. Finally, we improve the shortest path dependency kernel by replacing the strict same length requirement with finding the longest common subsequences between two shortest dependency paths. Experiments prove that kernel-based methods are effective for Chinese relation extraction as well.
  • Review
    DONG Fang
    2008, 22(5): 109-113, 120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Shui script is an ancient ethnic and religious scripture currently used in the south of Guizhou province in China. Since the font style of the Shui script is complex, it is difficult to relate the Shui characters into components or the code units according to Chinese character coding theory. And it is also more difficult to code the Shui script by the phoneme coding owing to the difficulty in pronouncing Shui characters. This paper puts forward the coding of the Shui script by the class-attribute method. The shui character is coded in four digits in totalwith the first digit representing its class (a formal sui character or a variant), the second digit attributing its reference content, and the last two digits indicating the position of the character in corresponding content attribution. Based on the proposed coding scheme, this paper finally presents the idea of visual input method of Shui character.
  • Review
    WU Xiao-chun, WU Xian,LI Pei-feng,ZHU Qiao-ming
    2008, 22(5): 114-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Along with the development of handheld device, mobile phones are becoming more and more important in people’s daily life. However, compared with the rich applications on mobile phone, the research of its input methods are relatively less touched. Aiming at the limited memory and low CPU speed of mobile phones, this paper puts forward a sentence-level search algorithm. Firstly, this paper introduces the file structure of reverse order. Then it designs a sentence-level search algorithm with N width, which has been implemented successfully on the S60 platform. Lastly, this paper tests the accurate rate of this algorithm, and the result proves its good effect.
  • Review
    CAI Jing-zhe,CUI Rong-yi
    2008, 22(5): 121-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, the intrinsic ambiguity of linear reconstruction of Korean characters is studied and a scheme for eliminating such ambiguities is proposed.. Firstly, the formal description methods for the structure of Korean characters are investigated, illustrating the basic combination rules of Korean characters and establishing a corresponding definite state automaton. Furthermore, the mathematical descriptions for linear reconstruction of Korean characters are presented. The necessary and sufficient conditions of the ambiguity in reconstructing Korean characters are proved, and the essence of the degree of ambiguity and the probability of the ambiguity are analyzed. Finally, a disambiguation approach is suggested for character reconstruction and the cardinal-grapheme-based on-line Korean character string input algorithm is proposed. The results of simulation experiments show the reliability and validity of proposed method.