2008 Volume 22 Issue 4 Published: 15 August 2008
  

  • Select all
    |
    Review
  • Review
    HE Liang, DAI Xin-yu, ZHOU Jun-sheng, CHEN Jia-jun
    2008, 22(4): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    After an analysis on Dan Bikel’s Parser which is based on head-driven statistical model, this paper presents some improvements on this distinctive parser for Chinese parsing. Firstly, a separate N-best POS-tagging module is provided to enhance the morphological processing. Secondly, an independent BaseNP identification module is integrated as another preprocessing module to decrease the complexity in Chinese parsing. And according to the characteristic of Chinese, several extended definitions of BaseNP are introduced, demonstrating that a suitable definition to BaseNP can help improve the performance for Chinese parsing. Finally, experiments are conducted for the refined Chinese statistical parser and the results indicate that both the efficiency and accuracy of Chinese parsing are improved significantly.
  • Review
    QIAO Wei, SUN Mao-song
    2008, 22(4): 10-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    Overlapping ambiguity is a major type of ambiguity in Chinese word segmentation. The performance of existing word segmentation systems in resolving this ambiguity is still unsatisfactory, especially in dealing with domain-specific texts. In contrast to those quite detailed statistical observations on overlapping ambiguities in general-purpose corpus, similar observations in domain-specific corpus have not been reported in the literature. In terms of a medium-sized general-purpose Chinese wordlist, a general-purpose corpus with over 900 million characters and two domain-specific corpora with total 140 million characters covering 55 domains, statistical properties of high frequent overlapping ambiguities are addressed and studied from several perspectiveswith overlapping ambiguity string from general corpus examined in the domain corpus, and vice versa. It is believed that the finding of this paper will benefit word segmentation disambiguation in particular for domain-specific texts.
  • Review
    YUE Ming
    2008, 22(4): 19-23,42.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper reports a rhetorical structure annotation project to a Chinese news commentary corpus Caijingpinglun (CJPL) for the purpose of natural language processing. The elementary discourse unit (EDU) in this project is defined as a string between two selected punctuation marks. And altogether 47 Chinese rhetorical relations are defined to mark the nuclarity according to the classic rhetorical structure theory (RST). A 60-page annotation manual with detailed rules of EDU segmentation, EDU combination, relation and scheme tagging protocols are composed. Analysis on the first manually annotated set of 97 texts shows that the RST has good cross-language transferability to Chinese, and the data obtained from this project may be further exploited in Chinese discourse processing.
  • Review
    LIU Yao, DUAN Hui-ming, WANG Hui-lin ,ZHOU Yang, WANG Zhen-guo, LI Hong-zhan
    2008, 22(4): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Domain corpus is essential to the natural language processing for domain documents, especially for its content and intention analysis. Based on the specific research background, this paper first elaborates the necessity and significance of natural language processing for domain documents. After the analysis on the characteristics of the domain corpus, this paper probes into the design strategy and principle of domain corpus construction. Meanwhile, it also investigates into the part of speech tagging in the corpus. Finally a human-aided processing system for domain corpus is developed, providing some theoretical guidance and technique support for domain corpus construction.
  • Review
    HUANG Jian-nian, HOU Han-qing
    2008, 22(4): 31-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    The collation of ancient books on agriculture has arouse the attention of the research circle. But the automatic sentence segmentation and punctuation model for these books have been less touched. This article probes into this issue and summarizes certain patterns on sentence splitting and punctuation model for ancient books on agriculture. It is proposed that the sentence is initially segmented by syntax words (like empty word, conjunction and modal words) and synonyms indication words. Then antonyms, cited books indications, time sequence, quantifiers, pleonasms and verb-object structure are employed for further sentence segmentation and punctuation fill-up. Also the comparative sentence supplies an auxiliary means of judgment of complex sentences and punctuation of clauses. Finally the terms in agriculture and the stop punctuation list are applied to improve the readability of these books after marking the punctuations. In experiments, the average precision of the punctuation model reaches 48% and 35% respectively, which shows the feasibility and the potentially of the proposed method.
  • Review
    MAI Fan-jin, WANG Ting
    2008, 22(4): 39-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    Languages play an important part in the thinking of human beings. Relativity, similarity, and contrariety are three ways of the thinking association. At present, the researches of relativity and similarity are well touched while the contrary degree is less studied. This paper introduces the negative value into the similarity calculation and puts forward the conception of the contrary degree as well as a computational model. The feasibility and validity of the proposed conceptual model and computational model are demonstrated with a simulated test.
  • Review
    Mireguli Aili, Mijiti Abulimiti, Aisikaer Aimudula
    2008, 22(4): 43-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the particular phenomenon of vowel weakening in Uyghur and proposes an algorithm to identify the Uyghur vowel weakening based on the analysis of the word structures, syllable structures, combination rules of stem plus suffixes in Uyghur language. To identify the vowel weakening, the algorithm will first locate the property of vowel weakening according to the stem database, and then determine if the stem is correctly associated with a suffix by the phonetic harmony criterion. The algorithm is readily applied in the fields of text retrieval, word frequency calculation and spelling check. The experimental results show that the algorithm is feasible and effective.
  • Review
    DU Jin-hua, WEI Wei , XU Bo,
    2008, 22(4): 48-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on several popular methods of statistical machine translation combination, an improved multiple-system combination framework is proposed. This framework integrates Minimum Bayes-Risk (MBR) decoding and multi-feature Confusion Network (CN) decoding techniques with the following steps(1)MBR decoding technique is used to select the hypothesis with minimum risk as an alignment reference from several N-best results produced by translation systems ; (2)CN is constructed by aligning the other hypotheses with the reference. Based on log-linear model, the CN introduces more knowledge sources into the selection of optimal path. Compared with the best system without combination, the proposed framework has 2.19% improvement in BLEU score. Inaddition, we present a modified Translation Edit Rate (TER)—GIZA-TER metric for CN alignment, which facilitates a more effective phrase re-ordering. The significance tests demonstrate the validness of the proposed methods.
  • Review
    CHEN Huai-xing, YIN Cun-yan, CHEN Jia-jun
    2008, 22(4): 55-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Identification of translingual equivalence of named entities is substantial to multilingual natural language processing. Some approaches to named entity translation, such as bilingual dictionary lookup, word/sub-word translation or transliteration, have been explored in the past years. Another promising approach is to extract named entity translingual equivalence automatically from a parallel corpus, which usually requires the named entities to be annotated manually or automatically for both languages. In this paper, we propose a new approach to extract equivalence of named entities from a parallel corpus with only the source language annotation and the result of HMM alignment. The experiment is carried in a Chinese-English parallel copus, and we treat Chinese as the source language and English as the target language. The result shows that our new approach achieves high quality of named entity pairs with relatively high precision, even though sometimes the word alignment result is partially correct.
  • Review
    WANG Ming-wen, TAO Hong-liang, XIONG Xiao-yong
    2008, 22(4): 61-65,74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Collaborative filtering is widely applied in E-Commerce recommendation system. However, data sparcity affects the accuracy of prediction and results in poor recommendation. To address this problem, a novel collaborative filtering algorithm is presented based on the iterative bidirectional clustering method. It works on the initial user clusters and the item clusters, adjusting the two groups of clusters into the stable status by the cross iteration so that the distances within the cluster are much smaller whereas the distances between the clusters are even bigger. The experiments illustrate that the adjusted clusters facilitate a more accurate neighbor search, indicating an efficient solution to the data sparcity and better recommendation quality.
  • Review
    DING Fan, WANG Bin, BAI Shuo, LIU Yi-xuan, LI Ya-nan,
    2008, 22(4): 66-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    To relax the term independence assumption, term dependency is introduced and it has improved retrieval precision dramatically. There are two kinds of term dependenciesone is defined by term proximity, and the other is defined by syntactic dependencies. Inthis paper, we take a comparative study to re-examine these two kinds of term dependencies in dependence language model framework and presents a smooth-based dependence language model. We studied the effectiveness of syntactic dependencies in query representation and document representation respectively. The experimental results on TREC collections show1) Syntactic dependencies get a better result than term proximity in document representation. 2) Inquery representation, concept-based part syntactic dependencies are more effective than other syntactic dependencies.
  • Review
    ZHANG Sen, WANG Bin
    2008, 22(4): 75-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    An increasing number of researches have been focused on the classification of web queries in recent years. This article centers around the researches on automatic query classification according to query intention. It presents a survey of the background of query classification, its key techniques, the classification algorithms and the evaluation methods. And it outlines the problems and challenges in query intention classification, i.e. lack of authoritative evaluation method, the inadequate performance comparisons on large scale dataset, the acquisition of accurate query features, and the issues in the completeness and objectivity of a category system.
  • Review
    GUO Yu-sheng, TAN Nu-tao, HUANG Lei, LIU Chang-ping
    2008, 22(4): 83-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to extract mathematical expressions (MEs) in scanned Chinese document, a ME identification method based on Chinese character recognition and ME symbol recognition is proposed. In this paper, Chinese blocks are firstly deleted based on a decision tree using the features from Chinese character recognition result, ME symbol recognition result and character’s geometric information. Then the embedded MEs are extracted from non-Chinese character blocks based on semantic information of ME, syntax information and script relation between adjacent blocks. Finally, the isolated MEs without Chinese blocks are identified for embedded ME symbols by Gaussian Mixture Model. The experiments were carried on a dataset with 148 document images containing 3690 MEs, and the results show that the proposed method reaches 91.19% in the ME identification accuracy.
  • Review
    PAN Yi-qian, WEI Si, WANG Ren-hua
    2008, 22(4): 88-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    The tone evaluation of Chinese continuous speech is a key aspect in Mandarin Chinese pronunciation test. Taking advantage of the close correlation between the prosody framework and the modified tonal curve, this paper presents a Multi-Space Distribution Hidden Markov Model (MSD-HMM) built on the prosodic word for the tone evaluation. The experimental results show that the proposed Mandarin Chinese Pronunciation Evaluation System improves from 82.0% to 84.6% in the performance of tonal syllable error rate for the standard Chinese continuous speech. And for the non-standard Chinese Mandarin speech, the correlation between computer score and expert score achieves over 3.0% absolute improvements compared with that of the baseline system without tone pronunciation test.
  • Review
    CHEN Si-bao, HU Yu, WANG Ren-hua
    2008, 22(4): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Heteroscedastic linear discriminant analysis (HLDA) is applied widely in speech recognition due to its ability of feature de-correlation. To overcome its instability on high dimension features and the small sample issue on insufficient training samples, this paper proposes a structure-specific HLDA method to transform the feature matrix. The method adopts the two-dimensional linear discriminant analysis (2DLDA) to compress features in the matrix, and then, the one-dimensional HLDA is applied. It is revealed that two-dimensional feature transformation is actually a structure-constrained one-dimensional feature transformation. Experiments show that the proposed structure-specific HLDA achieves 12.39% word error rate (WER) reduction on RM database and 4.43% phone error rate (PER) reduction on TIMIT database compared with the traditional HLDA.
  • Review
    Gulijiamali Maimaitiaili, Aisikaer Aimudula
    2008, 22(4): 100-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    This system first establishes the small speech corpus including single phoneme and double phoneme segmented by recorded words from selected sentences in Uyghur language. Afterwards, it designs the unit selection algorithm and employs the parameter adjustment algorithm to adjust parameters like length, pitch frequency and short-term energy in the speech signal. Finally it applies the time domain smoothing algorithm in adjusting speech parameters at the concatenated points so as to enhance the naturalness of synthesized speech. The whole process is developed by C# ,and experimental results proves the feasibility of the proposed scheme and technology. The system has advantages of small speech corpus, relatively high understandability and naturalness for the synthesized speech.
  • Review
    Ngodrup
    2008, 22(4): 105-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    It’s necessary for the developers of Tibetan software to know how do encode Tibetan in UCS when processing Tibetan data. Understanding the UCS Tibetan encoding system must come before reorganizing UCS Tibetan data when designing Tibetan websites, processing Tibetan text, developing Tibetan application software, or designing OpenType or AAT fonts. To facilitating the understanding of the UCS Tibetan encoding system, this article explains in detail the organizational structure and design methods of UCS Tibetan encoding system, so that the Open Type can be applied to display complex Tibetan documents.
  • Review
    HUANG He-ming, ZHAO Chen-xing
    2008, 22(4): 109-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    DUCET(Default Unicode Collation Element Table) is an international standard of character collation. This paper proposes a method of DUCET-based Tibetan sorting algorithm. It first expands two-dimensional Tibetan scripts into a one-dimension string of Tibetan letters. Then it locates the collation code of each Tibetan letter from DUCET. Finally, by comparing any two distinctive collation code strings, including Tibetan scripts and non-Tibetan scripts, a correct DUCET-based Tibetan order will be achieved.
  • Review
    LANG Feng-zhen,Turgun IBRAYIM
    2008, 22(4): 114-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to prompt the knowledge of the folk-custom of Sinkiang to the world and realize a seamless connection between the digital museums of each universities, this paper proposes a scheme of the Web service based digital folk-custom museum construction in English ,Chinese, and Uigur languages. After a brief introduction to the idea of digital folk-custom museum, this paper presents the whole framework of digital folk-custom museum,discussing the key techniques in Web service and ASP.Net. Finally this paper describes the implementation of the digital museum by Web service techniques on ASP.Net, with the advantages of quick and convenient client browsing for transparent information services in English, Chinese,and Uigur.
  • Review
    GAO Ding-guo, Ngodrup
    2008, 22(4): 119-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    The “Tibetan coded character sets for information interchange-Basic set” has laid the foundation for Tibetan information process. However, the set is lack of some essential features in Tibetan construction, and not free from ambiguities in Tibetan character coding.. This paper suggests that three head letters should be added into the basic set to indicate the specific domains of different coded characters and avoid the ambiguity. In addition, the paper proposes a method of "identification of Tibetan coded characters" to eliminate the meaning differences among characters. At last, some other problems of character features and corresponding explanations are mentioned.
  • Review
    GU Shao-tong, MA Xiao-hu, YANG Yi-ming,
    2008, 22(4): 123-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes and implements a simple and effective Jaguwen input method. Based on the analysis of the features in topological frame of Jiaguwen and the consideration of the glyph and pronunciation of Jiaguwen, the coding of glyph and pronunciation is first presented. Accordingly, the implementation of this input method provides two ways to input Jiaguwen(1) Glyph coding input method; (2) Hanzi Spelling coding input method.