2009 Volume 23 Issue 4 Published: 17 August 2009
  

  • Select all
    |
    Review
  • Review
    LI Jihong,YANG Xingli, WANG Ruibo, ZHANG Na, LI Guochen
    2009, 23(4): 3-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper constructs a set of heuristic rules for six types of question regarding to time,human, location, number, entity and description in Chinese QARC system. Each rule is further assigned with a weight optimized by the orthogonal array. Then the calculation of each candidate answer sentence is described over corresponding rules. The experiment on the CRCC v1.1 (Chinese reading comprehension corpus) built by Shanxi University produces 83.09% HumSent accuracy. Compare with the results of ME-based method, the proposed approach achieves 81.13% HumSent accuracy, which is about 1% higher than the ME-based results on the same training and testing environment.
    Key wordscomputer application; Chinese information processing; reading comprehension; question answering; heuristic rules; orthogonal array
  • Review
    HUANG Xian,ZHANG Keliang
    2009, 23(4): 10-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Anaphora has always been a focus of linguistic researches, and the anaphora resolution is also of utmost importance to natural language processing (NLP). This paper introduces theoretical researches on the zero anaphora (ZA) in Chinese from four aspects, namely, syntax, pragmatics, discourse analysis and cognitive linguistics. The paper also summarizes how zero anaphors are used and distributed in different languages and various styles of writing. In terms of natural language processing, there are some substantial researches made on ZA in Chinese, such as the ZA-resolution models based on the Centering Theory, HNC-based analysis of ZA with its chunk-sharing model and DRT theory based efforts. The paper concludes by suggesting that NLP experts should pay more attention to theoretical researches of linguistics, while linguists engaged in this field should also orient their researches toward formalization of natural languages.
    Key words computer application; Chinese information processing; zero anaphora; linguistics; natural language processing
  • Review
    WANG Lijie, CHE Wanxiang, LIU Ting
    2009, 23(4): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    The SVMTool is a simple, flexible and effective generator of sequential tagger based on Support Vector Machines, capable of dealing with a large number of linguistic features. In this paper, SVMTool is applied in Chinese POS tagging task and improves the accuracy by 2.07% compared with the baseline system on the Hidden Markov Model. To further improve the accuracy of unknown words, we introduce some features of Chinese characters and words, such as radicals of Chinese characters and reduplicate words, and probe into a theoretical analysis for their feasibility. Experiments indicate that these features can improve the accuracy of unknown words by 1.16% as well as reduce the error rate by 7.40%.
    Key words computer application; Chinese information processing; part of speech tagging; SVMTool; unknown word; radicals of Chinese
  • Review
    MA Xu, XU Weiran, GUO Jun, HU Rile
    (. Peking University Health Science Center, Beijing 0008, China;
    . School of Information and Communication Engineering, Beijing University of Posts and Telecommunications,
    Beijing 00876, China; . Nokia Research Center(China), Beijing 000, China)
    2009, 23(4): 22-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the popularity of short messages, smart SMS tools are urgently demanded by users, operators and government departments. However, there is no open standard SMS corpus, which is an indispensable resource for the algorithm research, system development and performance test etc, due to the technology, the copyright protection, the privacy right and other various reasons. SMS-2008, as an annotated Chinese SMS Corpus, takes the lead in establishing a multi-purpose Chinese text message corpus, which includes the original corpus, privacy tagged corpus, content tagged corpus, errors tagged corpus. This Corpus can be applied in the research of SMS language, SMS classification, privacy protection algorithm or automatically correcting system.
    Key words computer application; Chinese information processing; Chinese short message; tagged corpus
  • Review
    XU Yongdong, WANG Yadong, LIU Yang, WANG Wei, QUAN Guangri
    2009, 23(4): 27-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentences ordering is a key issue in the multi-documents automatic summarization, which influences the fluency and readability of the summarization. Among them, temporal information processing is the bottleneck technology which affects the quality of the ordering algorithm. Traditional ordering methods ignore this factor because the temporal information processing is very difficult, and, as a result, they could not achieve steady and high-quality ordering effects. To address this issue, this paper proposes an algorithm of Chinese text temporal information extraction, semantics computation and temporal reasoning. Then, based on the strategy of the majority ordering and the computation of sentences similarity, we propose sentences ordering algorithm based on the temporal information. The experiments show that the quality of this algorithm outperforms the calssical majority ordering algorithm and the chronological ordering algorithm.
    Key words computer application; Chinese information processing; multi-documents automatic summarization; sentences ordering; Chinese temporal information processing
  • Review
    MENG Xiaoliang, HOU Min
    2009, 23(4): 34-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a common discourse phenomenon, discourse markers have become an important subject in the discourse analysis. Due to various research perspectives, there still exist substantial differences in the perception and classification of discourse markers. From the perspective of style, this paper proposes the concept of “style degree” for the discourse marker, hypothesizing it bears certain stylistic features. The distribution of sampling discourse markers in the corpus of different styles is found with obvious distinction, and the Rocchio method based on these markers classify the text with a precision of 82.9%. It is concluded that the stylistic feature of discourse markers is a valuable in the text classification.
    Key words computer application; Chinese information processing; discourse marker; stylistic feature; style degree; similarity; classification of texts
  • Review
    LAI Maosheng, QU Peng
    2009, 23(4): 40-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query is Web user’s primary method to express his/her information need in searching. Related term provided by systems is a useful tool to refine his/her query. The paper focuses on query and related term; describes and analyzes them from user’s utilization behavior aspect. Log mining is used to give descriptive statistics on query words; qualitative categorization is then used to divide the query words into primary and auxiliary keywords. The result of qualitative analysis is compared with the result of a questionnaire survey. Important finding are as the following. Users use auxiliary keywords greatly. The content of primary keyword is relatively concentrated. Query length is short and the query syntax is simple. From both the questionnaire and the controlled experiment results, we find that users have high recognition and low utilizations on related terms. The study provides empirical results to understand user’s language utilization and also data for search engine to refine its index.
    Key words computer application; Chinese information processing; Chinese Search Engines; Information Behavior; Language Utilization; Log Mining; Questionnaire Survey; Controlled Experiment
  • Review
    TIAN Baoming, DAI Xinyu, CHEN Jiajun
    2009, 23(4): 48-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    Term-based Vector Space Model (VSM) is a traditional approach to representing documents, which defects in its neglecting of the relations between terms. To capture the relations between the terms, some latent topics-based document representations such as LDA (Latent Dirichlet Allocation) have arisen much attention recently. However, simple latent topic-based text representations may cause loss of information carried by terms. In this paper, we use a modified random forests method to combine the term based and the LDA latent topic based documents representation. Random forests are constructed separately for two kinds of text representations and the final classification result is decided by vote scheme. The experimental results on some standard datasets show that, compared with methods only using one set of text features, our method can efficiently combine two kinds of text representations and improve the performance of text categorization.
    Key words computer application; Chinese information processing; text categorization; VSM; latent dirichlet allocation; ensemble classification; random forests
  • Review
    YU Zhenshan, HUANG Liusheng, CHEN Zhili, LI Lingjun, YANG Wei, ZHAO Xinxin
    2009, 23(4): 55-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text steganography is a method of concealing secrets in texts. Different from cryptography which encrypts plain text to meaningless strings, text steganography generates innocuous stego-texts, which arouse less suspicion. However, compared with other types of multimedia documents such as image and video, text is not a well developed kind of carrier in information hiding because of its low redundancy and, consequently, the low embedding ratio achieved. A novel text steganography algorithm using Ci-poetry of the Song Dynasty is proposed in this paper, and the system composed of the encoder, the decoder, the lexicon and the tune template is realized. Secret messages are embedded into stego-Cis of the tune with proper number of lines, words, sentence patterns, rhythm and rhyme. This system reaches 16% embedding ratio while ensuring linguistic robustness. This is, to the best of our knowledge, the first text steganography algorithm making use of special type of literature.
    Key wordscomputer application; Chinese information processing; information hiding; text steganography; embedding ratio; linguistic security; Ci-poetry of Song Dynasty; tune
  • Review
    WANG Shi, CAO Cungen
    2009, 23(4): 63-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    WordNet is an important English lexical semantic knowledge base. This paper presents a method for the automatic translation of the synsets in the WordNet into Chinese, named as WNCT. Firstly, WNCT uses dictionaries and term translation tools to translate the senses of English words in the WordNet into Chinese. Then WNCT regards the selection for correct sense of the words in a synset as a classification issue. The classification model is then trained by 12 features extracted according to the uniqueness of translation, the translation intersections within and between the concepts, the construction rules for Chinese phrase as well as PMI based translation relevance. Experimental results show that WNCT achieve 85.21% coverage rate and 81.37% accuracy for the Chinese translation of the synsets in WordNet 3.0.
    Key words artificial intelligence; machine translation; WordNet translation; word translation; translation disambiguation; Chinese lexical knowledge base; Chinese information processing
  • Review
    LIU Changqing
    2009, 23(4): 71-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently, researches on Xixia characters develop so much and a large number of Xixia documents have been published with their original forms at home and abroad. How to carry out the fast digitalization of those documents is of great importance. Based on the level set technique, we first process those documents by the smooth algorithm, and then the contours of Xixia characters are extracted by Level set. Level Set evolutionary function is descritized by the fourth-order symmetrical compact finite different scheme in spatial direction. Narrow-band algorithm and global optimization methods are adopted in computation. The experiment proves to be effective and can be applied to extracting relatively accurate contours of Xixia characters.
    Key words artificial intelligence; pattern recognition; Xixia characters information processing; level set method; Xixia characters; contour extraction; compact difference
  • Review
    FU Qiang, SONG Yan, DAI Lirong
    2009, 23(4): 77-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    In language identification system, the performance is substantially affected by the session variability including speaker variability; channel variability etc. In this paper, factor analysis is introduced to estimate the session variability subspace. According to the characteristics of the language identification task, the statistical model construction algorithm is discussed. Finally, both the model and the feature domain compensation methods are proposed. In NIST LRE 2007 30s test corpus, the experiment results show advantage of the proposed method, with a relative reduction in the equal error rate (EER) for about 36.5% compared with the baseline GMM-UBM system.
    Key wordscomputer application; Chinese information processing; language identification; GMM model; factor analysis
  • Review
    NI Chongjia, LIU Wenju, XU Bo
    2009, 23(4): 82-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    The analysis and modeling of the information structure and the prosodic structure in a sentence or in the discourse is the key to improve the natural degree of speech synthesis and reduce the error rate of speech recognition that analyzes. Based on large speech corpus (ASCCD) with prosodic structure label, this paper presents the statistical results on the characteristics of the duration and the pitch included. The first discovery is that the prosodic border can obviously prolong the duration of syllable and different tone and accent have different effect to prolong the syllable duration. The second finding is that the break duration at prosodic border, especially at little prosodic border is more obvious. It is obvious that F0 reset always occurs between prosodic phrases. The F0 bottom line is always declined. The F0 top line is declined after the accent. And at accent position, the rage of pitch is big and the top line is high.
    Key wordscomputer application; Chinese information processing; major prosodic phrase (MAP); minor prosodic phrase (MIP); duration; pitch
  • Review
    LI Yanping, TANG Zhenmin, ZHANG Yan, DING Hui
    2009, 23(4): 88-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a new discriminative feature based on adaptive frequency warping. Based on the discriminative analysis of the frequency components and their quantification results, this new feature is extracted by non-uniform sub-band filters designed according to the adaptive frequency warping in different frequency bands; Furthermore, in order to overcome the mismatch between training speech and testing speech under the noisy environment, we adopt pre-enhancement before the feature extraction. Through a series of controlled experiments, it is shown that the proposed feature is insensitive to the speech content and thus more discriminative and robust in comparison to the conventional Mel frequency cepstral coefficients. The experimental results demonstrate that combining pre-enhancement and proposed feature leads to noticeable improvement on speaker recognition rate and robustness.
    Key wordscomputer application; Chinese information processing; speaker identification; adaptive frequency warping; discriminative feature; robustness
  • Review
    WAN Jiping, XIAO Yunpeng, YE Weiping
    2009, 23(4): 95-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    To detect the pronunciation errors and correct them is very important for pronunciation learning. The Automatic Pronunciation Error Detection (APED) is the technique of detecting pronunciation errors in the speech stream, which is one of the main research issues in the Computer Assisted Pronunciation Training (CAPT) area. This paper reviews the literature on the techniques, introducing in detail three APED methodsthe automatic speech recognition (ASR) based APED, the pronunciation error networks based APED and the acoustic-phonetic approach. It also summarizes the applications of APED in CAPT, and the automatic pronunciation evaluation technologies for mandarin.. Finally, the paper gives some analysis and suggestions for research on automatic pronunciation error detection.
    Key wordscomputer application; Chinese information processing; automatic pronunciation error detection; computer assisted language learning; computer assisted pronunciation training; pronunciation evaluation; automatic speech recognition
  • Review
    Hankiz Ilahun, Zulfiya Aman, Askar Hamdulla
    2009, 23(4): 103-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    To improve the naturalness of speech synthesis, this paper investigates the acoustic features of 63 monosyllabic words with consonant cluster from the “Uyghur voice acoustic parameters database”, which is recorded by both a male and a female speaker. We focus on the rule of combination and the statistics of consonant cluster of the monosyllabic words in Uyghur language. From the language typology point of view, that monosyllabic words including consonant cluster in modern Uyghur language have a fixed acoustic feature of a shorter length while a stronger intensity in the first consonant than the second. In contrast, the combination of consonants is not fixed, because the composition of the consonant cluster remains open.
    Key wordscomputer application; Chinese information processing; Uyghur Language; consonant cluster; acoustic analysis; acoustic parameters
  • Review
    CAI Rangjia
    2009, 23(4): 107-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the automatic segmentation and POS tagging, this paper proposes a Tibetan word category system and a annotation scheme after a careful analysis over a large Tibetan corpus. According to the practical demands on the Tibetan corpus, the Tibetan words are first divided into several main categories according to where they are content words or function words. Then several fine granularized sub-categories are further suggested. This framework has been proved valid for the processing of a Tibetan Corpus with 10 million characters.
    Key wordscomputer application; Chinese information processing; corpus;Tibetan phrases; category; mark gathering
  • Review
    TASHI Gyal, ZHU Jie
    2009, 23(4): 113-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic word segmentation is essential to Tibetan Information Processing as well as a key technology in intelligent Tibetan information processing area. To resolve the standards for the word class and the word segmentation which is a premise for this issue, this paper firstly classifies the Tibetan words accoring to requirements of Tibetan information processing, and then provides a systemic and applicable word segmentation scheme.
    Key wordscomputer application; Chinese information processing; segmentation scheme; Tibetan; information processing
  • Review
    ZHANG Qing, HUANG Heming, ZHANG Dengyi
    2009, 23(4): 118-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, many publishing systems, such as Bei Da Fang Zheng and Hua Guang are widely applied in the printing industry for issuing Tibetan publications in the domestic minority areas. Due to the different coding system in these systems, the valuable electronic resources for Tibetan languages cannot be exchanged and shared. This paper proposes a solution to convert Tibetan code of different system into the international standard. It further realizes such conversion system for Hua Guang windows encoding of Tibetan into the ISO/IEC 10646 encoding, with a designed sub-table&group strategy in hash.
    Key wordscomputer application; Chinese information processing; Tibetan; character encoding standard; code conversion;encoding sort ; query
  • Review
    TIAN Shengwei, Turgun Ibrahim, YU Long
    2009, 23(4): 124-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    The efficient retrieval of the candidate translation example from the large scale translation example base is fundamental issue in the study of EBMT. This paper proposes an Uyhur t Hash function designed according to the distribution of the uyhur words and characters, which, on the equiprobable condition, facilitate an average search length of 1.59. To resovle the conflict in the Hash table, a new mechanism name second optimal tree for synonym is established as regards to the frequency of the conflicting Urhur words. The experiments show that the proposed approach achieves 27.5% and 21.8% improvement in the performance compared with the sequential chain and binary search approach respectively.
    Key wordscomputer application; Chinese information processing; EBMT; hash; average search length; second optimal tree