2010 Volume 24 Issue 4 Published: 16 August 2010
  

  • Select all
    |
    Review
  • Review
    WU Tong, ZHOU Yaqian, HUANG Xuanjing, WU Lide
    2010, 24(4): 3-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a generic algorithm for Time Expression Recognition (TER) task based on regular expressions. The algorithm generates rules based on “Basic Time Unit”, which improves the recall value. And it prunes the rule collection through error driven method and reduces the “noise” taken from training corpus, which leads to a high precision. The two features jointlyimprove the overall efficiency of our method compared to the baseline systemwith a significant better performance of up to 89.9% F-score on ACE07 Chinese Corpus. In addition, the proposed algorithm has good adaptablility and scalability for a broader application.
    Key wordscomputer application; Chinese information processing; time expression recognition; basic time unit; Timex2; error-driven; regular expression
  • Review
    HUANG Chen1,2, QIAN Longhua1, ZHOU Guodong1, ZHU Qiaoming1
    2010, 24(4): 11-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a convolution tree kernelbased approach for unsupervised Chinese entity relation extraction. This method first represents potential relation instances as shortest path-enclosed trees, then computes similarities between them using convolution tree kernel, finally groups them into various clusters through hierarchical clustering algorithms. Evaluation on the ACE RDC 2005 benchmark corpus shows that the convolution tree kernel-based approach achieves the highest F-measure of 60.1 on the task of unsupervised Chinese entity relation extraction, suggesting that this method is promising.
    Key wordscomputer application; Chinese information processing; entity relation extraction; unsupervised learning; convolution tree kernel
  • Review
    REN Hui,LIN Hongfei, YANG Zhihao
    2010, 24(4): 18-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    The overlapping ambiguity strings(OAS) is one of the difficulties in automatic Chinese word segmentation. This paper treats the resolution of OAS asa classification task, using maximum entropy integrating character features to solve the problem. In order to overcome the data sparseness in maximum entropy modeling, this paper introduces the inequality smoothing techniques and Gaussian smoothing techniques. We compared the Gaussian smoothing, inequality smoothing and frequency discount on the four datasets of the Second International Chinese Word Segmentation, proving that Gaussian smoothing, inequality smoothing are much better than the discount method.. while inequality smoothing enables the seamless integration of feature selectioninto the parameter estimation with the result of a significantly compressed model. On the four datasets, the precision of disambiguation by the proposed method can achieve 96.27%, 96.83%, 96.56%, 96.52% respectively, with a relative improvement of 5.87%, 5.64%, 5.00%, 5.00% by the rich feature and a relative improvement of 5.87%,5.64%, 5.00%, 5.00% by smoothing technology. Meanwhile, the classification models are compressed by 38.7, 19.9, 44.6, 9.7 by using inequality smoothing.
    Key wordscomputer application; Chinese information processing; word segmentation; overlapping ambiguity strings; character feature; maximum entropy model; smoothing technology
  • Review
    LIN Chen, LI Bicheng, ZHOU Jie
    2010, 24(4): 25-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    The person is an important object of comment in the in Netnews oral reviews, and thus the identificaiton of persona names is essential to the sentiment analysis for oral reviews. This paper resentss an efficient method for identifying the person namesbased on the text features in Netnews oral reviews. The method firstly evaluates the reliability of a word as a part of personal objects via the multi-frequency as the discriminating clue; Secondly, certain windows are set up according to the clues and an improved algorithm of frequent pattern mining are applied to get the candidates. Lastly, the results are optimized by a series of ways. The experimental results display the method can efficiently identify the full person names commented in Netnews oral reviews.
    Key wordscomputer application; Chinese information processing; public opinion in Internet;oral reviews;person names;frequent pattern mining
  • Review
    TAO Xianjun1, WU Xiaojun2, WANG Xiaodong1 , ZHENG Fang2
    2010, 24(4): 39-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    In applications of natural language processing, especially in processing of spoken or web text, errors in word spelling and/or sentence structures are common to be found in the text to be processed. This paper describes a robust parsing algorithm based on the chart parsing method, which can identify the mistakes in the strings unrecognized by the domain vocabulary based word segmentation, and fix them into the correct forms according to the terminal information extracted from the current active arcs and the rule set. The experimental results showed that with error detection and correction by homonymous matching of pinyin syllables, this algorithm improvs the acception rate by 14.78% at the cost of an increase in the average number of loops by 9.363% compared with the robust parsing method of Yan.
    Key wordscomputer application; Chinese information processing; chart parsing methods; robustness; error detection
  • Review
    QIAO Jianmin, ZHANG Yangsen
    2010, 24(4): 44-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word sense disambiguation (WSD) is an important issue with wide application in natural language processing. Word sense tagging consistency would directly affect the quality of corpus, and in turn, it will affect the application of corpus. Due to the complexity and flexibility of the language and the defects of the algorithm, the ecurrent word sense tagging can not be accomplished perfectly by the WSD models, i.e. the WSD results are prone to errors and in consistencies. On the other hand, manual checking is costly in time and investment. On the basis of a survey on the “People Daily” corpus, the sentence similarity computation and the “Hownet”, a checking method for the word sense tagging consistency of the corpus of “People daily” is presented in this paper. The experiment result shows the feasibility of the method.
    Key wordscomputer application; Chinese information processing; WSD; word sense tagging consistency; hownet; corpus; sentence similarity computation
  • Review
    JIN Yanan1,2,LI Ruixuan1,WEN Kunmei1,GU Xiwu1,LU Zhengding1,DUAN Dongsheng1
    2010, 24(4): 52-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Social annotation as a new management and organization forms of resources has become a popular services on Internet and enterprise networks. It serves for four purposesmarking up, classification, resource detection and semantic feature, which can help user find out what they want. Hence, it is natural for social annotation to be used in information retrieval. This paper firstly introduces the conception, the objects and the methods of social annotation, then surveys on t the classification, the community detection and the semantic search with social annotation. Finally, it discusses the challenge and future work of social annotation research.
    Key wordscomputer application; Chinese information processing; social annotation; information retrieval; community detection; auto-annotation; classification
  • Review
    SONG Le1, 2, HE Tingting1, 2, WANG Qian1, 2, WEN Bin1, 2
    2010, 24(4): 63-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a new method of Chinese words semantic orientation computing based on How-net is presented. It uses the subjective sememe of How-net to calculate a new similarity which is called polar similarity, and the the word semantic orientation is determined by the plarity value. Experimental results show that the proposed methods can identify the word semantic orientation effectively, performing best in the subtasks of subjective words polarity analysis of the First Chinese Opinion Analysis Evaluation (COAE2008).
    Key wordscomputer application; Chinese information processing; subjective sememe; polarity similarity; polarity value
  • Review
    CHEN Xiang, LIN Hongfei, YANG Zhihao
    2010, 24(4): 68-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    A bilingual lexicon of biomedical terms plays an important role in biomedical cross-language information retrieval. Sentence alignment is the first step to build a bilingual lexicon. The Gaussian mixture model and transfer learning are applied to align sentences. The basic idea is to consider the sentence alignment as a classification task, which can be solved by the Gaussian mixture model classifiers based on the anchor information included in medical literature abstracts. At the same time, the sentence alignment model is built by combining biomedicine literature abstracts with New Concept English corpora, and it aims at applying transfer learning to train the length features and transfer them to the model. The experiments show it improves the performance of the sentence alignment model.
    Key wordscomputer application; Chinese information processing; sentence alignment; gaussian mixture model; transfer learning; anchor information
  • Review
    LI Maoxi, ZONG Chengqing
    2010, 24(4): 74-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a survey of system combination for machine translation (MT). According to the different levels of combining the outputs from different machine translation systems, we classify the approaches to system combination into three typessentence-level combination, phrase-level combination, and word-level combination. The representative work for each type is discussed in this paper, including the methods exploited, confidences estimated, and decoding algorithms, as well as the monolingual sentence alignment approaches which used to build the confusion network in the word-level system combination method. Finally, we discuss the three combination approaches and compare them with each other. The future development prospects of MT system combination are also discussed.
    Key wordsartificial intelligence; machine translation; system combination; minimum Bayes-risk decoding; confusion network decoding; word alignment
  • Review
    WANG Peng,HU Yu,DAI Lirong, LIU Qingfeng
    2010, 24(4): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mandarin is a kind of tonal language and the tone information plays a key role in Mandarin speech recognition. Within the framework of HMM (Hidden Markov Model), how to use tone information effectively is an important and open research issue. In the state-of-art Mandarin speech recognition system, there are two ways to apply tone informationthe one is Embedded Tone Model (in which the tone related features are appended to spectral features to form an augmented acoustic feature vectors to train HMM model), the other is Explicit Tone Model ( in which the one modeling is separated from syllable modeling and tone model is applied to optimize existed decoding network). This paper presents a way to combine these two methods to identify the isolated word in Mandarin speech recognition. Firstly, we get the Nbest items with Embedded Tone Model based on two-stream model rather than conventional single-stream model. Then the Explicit Tone Model based left dependent tonal model is established to re-score the Nbest items. The method proposed achieves over 5.0% absolute improvement in average in all test sets and up to 5.36% absolute improvement in NoiseCar test set compared with traditional model without tone information.
    Key wordscomputer application; Chinese information processing; computer application; Chinese information processing; Mandarin speech recognition ; tone information; tone model; two-stream model
  • Review
    TANG Lin1,HUANG Jianzhong1,YIN Junxun2
    2010, 24(4): 91-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    Many kinds of knowledge have been applied in this paper to separate the syllables, such as the prior information from the standard text of speech in Mandarin proficiency test, from the duration of initial in Mandarin speech which is stable in the normal speed speech, from the proportions of initials durations in related to the finals durations in ones speech and so on. A two-level syllable segmentation algorithm is proposed by using accumulating energies of the three wavelets which are re-constructured from wavelet transform. The experimental results demonstrat that the accuracy of syllable separation reaches to 98.3% at least.
    Key wordscomputer application; Chinese information processing;syllable segmentation; speech signal processing; Mandarin proficiency test
  • Review
    MEI Xiao1, XIONG Ziyu2
    2010, 24(4): 96-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on large-scale phonetic database, this paper establishes a predictive model to investigate the factors that affect the duration changes of rhymes and the relationship between the variation type and the prosodic structure in continuous speech. Preliminary results demonstrate that the prosodic structures have little effect on the durational changes of initials, while they trigger significant change in the duration of rhymes. Specifically, this effect can be summarized as that 1) the lengthening of the rhyme at the final syllable has close relationship with the prosodic structures, i.e., rhyme duration would be lengthened at the major prosodic phrase boundary and intonational phrase boundary while the duration would not be lengthened within the prosodic word or at the boundary of prosodic word; 2) there is no consistent manifestation of the rhyme duration at the boundary of minor prosodic phrase, which may deservefurther research in the future.
    Key wordscomputer application; Chinese information processing;initial duration; rhythm duration; prosodic structure
  • Review
    WANG Lu1,2, ZHAO Xinru1, XIE Zan2, YAN Zhiyu2, TAN Junhua1,XIAO Yunpeng2, LI Qiao2, ZHANG Xuebo2,YE Weiping2
    2010, 24(4): 104-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyzes the data of the National Putonghua Proficiency Test (NPPT), revealing that the proficiency of the test takers is influenced by their dialect background and professional background. Test takers majoring in literature achieve higher proficiency in NPPT. Further, the most likely mispronounced phonemes are collected by a study on syllable error-prone. It is suggested that, in order to have a good Putonghua pronunciation, it is necessary to emphasizeon the retroflex fricatives, alveolar fricatives, alveolo-palatal fricatives, nasal vowels and falling rising tone. In addition, an analysis on the inter-agreement between the raters shows there exist strong correlation between the subject scores and thus the final score of NPPT are relatively objective.
    Key wordscomputer application; Chinese information processing;National Putonghua Proficiency Test; dialect background; academic background; syllable error-prone; correlation of the score
  • Review
    Muheyat Niyazibek1, Kunsaule Talp2
    2010, 24(4): 111-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the era of network information, Kazakh information processing, as the subitem of Chinese information processing, is also gradually getting into study by certain researchers. The paper introduces the state-of-art of Kazakh information processing as well as some key conceptions, basic issues in this research. It also exhibits the challenges in Kazakh information processing and discusses its prospects.
    Key wordscomputer application; Chinese information processing;Kazakh language; information processing; key concpetion
  • Review
    Alifu Kuerban1, Wumaierjiang Kuerban2, Nijat Abdurusul1
    2010, 24(4): 114-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents the preliminary discussion and the attempt on the frame semantics description system of Uyghur language, and its tree structure of Uyghur frame semantics documents. According to the targets of Uyghur Framenet and the characteristic of frame semantic network, this paper centers around the Uyghur frame semantics for its storage in the database, and designed the conceptual model of Uyghur Framenet. It has provided a feasible technical roadmap for the construction of Uyghur Framenet.
    Key wordscomputer application; Chinese information processing;Uyghur language; frame semantics; conceptual model; entity relation
  • Review
    Wushour Silamu1, CAO Jinmei2, ZHU Xuelian3, CHEN Shaohong4
    2010, 24(4): 119-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focused on the difficulty in the library catalog system brought by Uighur, Kazakh and Kirghiz language, this paper presents a unified strategy based on the Unicode5.0 the UTF-8 encoding, dealing with the character set, database, application server and the clientin a hierarchical management and layered implementation.. The study summarizes the technologies in realizing the digital library system involving the Uigur, Kazakh and Kirghiz languageas well as the programming issue with both Mandarian Chinese and minority languages. This research is of practical significance for the subsequent development of the digititallibrary system involving languages of minorities.
    Key wordscomputer application; Chinese information processing;Uighur;Kazakh;Kirghiz;database;library
  • Review
    Muhammat1, Wushour Silamu2
    2010, 24(4): 123-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Xinjiang is a multi-racial autonomous region and the main languages used in this region are Chinese and Uighur. Focused on this particular circumstance, this paper proposes the overall framework of the Uighur and Chinese bilingual remote teaching system for the bilingual teaching in Xinjiang. Based on an analysis of certain excellent domestic remote teaching system, this paper designs the system architecture and the function modules of the system as well as the technique solution for displaying and switch between Uighur and Chinese bilingual information.
    Key wordscomputer application; Chinese information processing;bilingual teaching; remote teaching system; bilingualization; Unicode standard