2010 Volume 24 Issue 2 Published: 15 April 2010
  

  • Select all
    |
    Review
  • Review
    YUAN Yulin1,WANG Minghua2
    2010, 24(2): 3-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article firstly presents an inference model that consists of a knowledge base of entailment patterns along with a set of inference rules and related probability estimations, which approximates the textual entailment relationship and predicates whether an entailment holds for a given text-hypothesis pair. Then it introduces some methods of learning the inference rules and entailment patterns and their probability, including learning from a single or parallel/comparable corpus, or from the web. Finally, it describes the recognizing entailment models which based on lexical probability, e.g. lexical entailment probability models and lexical reference matching models, and the syntax and semantics driven models, e.g. the models based on the matching the dependency tree nodes or predicate-argument structures between a given text-hypothesis pair.
    Key wordscomputer application; Chinese information processing; textual entailmentinference model; entailment pattern; recognizing models; lexical probability; syntax and semantics
  • Review
    Chu-Ren Huang1,2, Shu-Kai Hsieh3, Jia-Fei Hong4,
    Yun-Zhu Chen1, I-Li Su1, Yong-Xiang Chen5, Sheng-Wei Huang1
    2010, 24(2): 14-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    The design criterion of Chinese WordNet (CWN) is to build a complete and robust knowledge system which also embodies a precise expression of semantic relations. Such precise expression for the Chinese sense division and the semantic relations must be based on linguistic theory, esp. lexical semantics. All word sense examples together with the lexical semantic relations in CWN are all attested with corpus data. Our methodology involves first analyzing language data and then combining the analyzed result with corpus by sense tagging to re-examine the accuracy of the analysis. For formal representation and computational application, a complete and robust knowledge system needs to be equipped with the formal integrity of ontology. The Suggested Upper Merged Ontology (SUMO) is adopted for this purpose.
    Key wordscomputer application; Chinese information processing;Chinese WordNet; global Wordnet grid; ontology; multi-language processing; cross-lingual integration
  • Review
    TANG Xuri, CHEN Xiaohe, XU Chao, LI Bin
    2010, 24(2): 24-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper presents a system for the recognition of Chinese location names on the discourse level. The system employs three modules in sequence, the CRFs-based module for simple location name recognition, the discourse-based module for the relationship identification between the simple location names and the CRFs-based module for complex location name recognition. The CRFs-based module for single location name recognition takes raw text as input and models both the information of internal structure of basic location names and the information of neighboring characters. The discourse-based module employs toponymhood calculation and discourse-based location name relation for recognition. The module of complex location name recognition is also based on CRFs but operates on the result of single toponym recognition. Experiments show that the system achieves the F-scores of 92.87% and 89.76% in close and open tests respectively.
    Key words computer application; Chinese information processing; discourse-based location name relation; conditional random fields; toponymhood calculation
  • Review
    PENG Weiminɡ,SONG Jihua
    2010, 24(2): 33-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper analyses certain issues in the process of constructing domain ontology, especially on the position of the "instance". Accordingly, a historical domain ontology project of Zizhi Tongjian is proposed, as well as the corresponding construction method of domain ontology. Taken the pattern-driven and bottom-up strategy, the Ontology of the Pre-Qin History is implemented and evaluated through the application of SPARQL searching and TouchGraph visualization. It is believed that this engineering practice may provide some ideas for the insiders who construct the domain ontology.
    Key wordscomputer application; Chinese information processing; domain ontology; construction method; ontology engineering
  • Review
    SHI Min, LI Bin, CHEN Xiaohe
    2010, 24(2): 39-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper explores the cross field between NLP and ancient Chinese, particularly the pre-Qin documents. The text of "Zuo Zhuan" is firstly analyzed after manual segmentationand POS tagging. Then the Conditional Random Fields model (CRF) is adopted for the word segmentation (WS), POS tagging (PT) and a unified process of WS and PT, respectively. The precision and recall of the unified approach are much higher than the independent WS and PT in the open test, with a F-score of 94.60% in WS and 89.65% in PT. This method is suitable for the study of ancient Chinese vocabulary and corpus construction, and can be applied to compensatethe manual tagging.
    Key wordscomputer application; Chinese information processing; Pre-Qin Chinese; word segmentation; POS tagging; Zuo Zhuan; conditional random fields model
  • Review
    TANG Qin,LIN Hongfei
    2010, 24(2): 46-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    In addition to the word features of a character’s name, we can recognize a character’ gender according to the differences of the words when a man or a woman is described in the text. In the paper, based on the different description of men or women of various aspects, we obtain a large number of significant words with gender differences, gender bias feature words and gender bias personal appellations. The experiment shows that gender bias feature words have a better description of different gender roles than gender bias personal appellations. Besides, the method of gender bias feature words combined with gender bias personal appellations and the word features of a character’s name has a better effect than using only the person names’ features.
    Key wordscomputer application; Chinese information processing;gender bias feature words; gender bias personal appellations; gender recognition
  • Review
    WU Kui1, ZHOU Xianzhong2, WANG Jianyu1, ZHAO Jiabao2
    2010, 24(2): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional algorithms for semantic similarity computation fall into two categoriesdistance-based and information-based methods. The former ignores the objective statistics, while the latter suffers from insufficient domain data. In this paper, a new method for similarity computation based on Bayesian Estimation is proposed. First, the concept emergence probability is assumed to be a random variable with a priori Beta distribution. Second, its priori parameters are designated by the distance-based similarity algorithm, calculated by Bayesian Estimation. Thereby, the semantic similarity integrating the subjective experience with the objective statistic is acquired based on information-based method. Finally, the proposed method is implemented and proved by a slightly higher correlation with human judgments against WordNet.
    Key wordscomputer application; Chinese information processing;ontology; semantic similarity; Bayesian estimation; Beta distribution
  • Review
    SU Chong, CHEN Qingcai, WANG Xiaolong, MENG Xianjun
    2010, 24(2): 58-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most of existing web page clustering algorithms are based on short and uneven snippets of web pages, which often causes bad clustering performance (e.g., STC and Lingo algorithms). On the other hand, the classical clustering algorithms for full web pages are too complex to provide good cluster label in addition to the incapability online clustering (for example, Kmeans algorithm). To address above problems, this paper presents an online web page clustering algorithm based on maximal frequent itemsets (MFIC). At first, the maximal frequent itemsets are mined, and then the web pages are clustered based on shared frequent item sets. Finally, clusters are labelled based on the frequent items. Experimental results show that MFIC can effectively reduce clustering time, improve clustering accrucy by 15%, and generate understandable labels.
    Key wordscomputer application; Chinese information processing;search engine; Web page clustering; frequent itemset
  • Review
    WANG Yun, LI Bicheng, LIN Chen
    2010, 24(2): 68-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    Web forums contain a wealth of information resources. Making full use of these information resources relies on web forums data extraction technology. This paper solves the problems of what data should be extracted and how to extract from web forums by the proposed method based on the similarity of page layout. The method can effectively avoid the disadvantages of current methods at low degree of automation or low accuracy. The method firstly recognizes the topic block by making full use of the special layout of the web forum pages, then extract data using rules from the topic block. Experimental results show that this method performs well in adjustability, precision and recall.
    Key wordscomputer application; Chinese information processing;Web forum; data extraction; similarity
  • Review
    QI Haoliang1, CHENG Xiaolong1, YANG Muyun2, HE Xiaoning3, LI Sheng2, LEI Guohua1
    2010, 24(2): 76-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    We designed and implemented a high performance Chinese spam filter. Online filtering mode is adopted in order to defense the evolution of spam emails. Logistic regression model is used as its filtering model; byte level N-gram is put forward to extract email’s features; and the filter is trained with TONE (Train On or Near Error) method. The performance of our filter is evaluated on Chinese spam corpora. It outperforms the best system in TREC 06 spam filtering track, gets 0.000 0% of 1-ROCA on SEWM07 immediate feedback task and ranks top in all SEWM 08 online learning tasks.
    Key wordscomputer application; Chinese information processing;Chinese spam filtering; online learning; logistic regression model; byte N-gram; TONE
  • Review
    PAN Tuoyu1,2, ZHU Zhenmin1,2, TENG Ji1,2, YE Jian1, ZENG Qingfeng1
    2010, 24(2): 84-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the dramatic increase of information available on the Internet, it is obviously a trend to provide users with personalized service. In this paper, through building a generalized service model based on ontology, the Items are classified into service sub-category. and the probability distribution of the users′ interests are calculated. On the basis of the combination of Content Filtering and Item-based Collaborative Filtering, an new ontology-based hybrid personalized recommendation model(OHR) is put forward. The experimental results show that OHR provides the better recommendation results than traditional collaborative filtering algorithms, as well as the better ability to discover the users′ new interests.
    Key wordscomputer application; Chinese information processing;ontology; hybrid personalized recommendations; item-based collaborative filtering; probabilistic model
  • Review
    ZHU Conghui1, ZHAO Tiejun1, HAN xiwu2, ZHENG Dequan1
    2010, 24(2): 91-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The verb subcategorization (SCF) is a more brief classification based on syntactic behaviors of verb and it is composed by a verb and several arguments. Recently it has attracted substantial researches for a single language, e.g. English and Chinese, whereas the cross-lingual subcategorization demands more systematic efforts. We present a novel method to obtain SCF argument crrespondence between Chinese and English based on active learning. This method can find the new relations through bilingual parallel sentence pairs almost without any priori language knowledge. We also integrated these relations to the statistical machine translation (SMT) system and experiment results show that the performance of SMT combined bilingual argument relationships has significant improvement, which indicates the validity of argument corresponding relationships automatically obtained.
    Key wordsartificial intelligence; machine translation; verb subcategorization; cross-lingual argument crrespondence; automatic acquisition; statistical machine translation
  • Review
    HE Jing1,2, ZHOU Ming2, JIANG Long2
    2010, 24(2): 96-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic poetry generation is considered difficult. In this paper, we propose a novel statistical approach for automatic generation of traditional Chinese metrical poetry from a few user-supplied keywords. A template-based model is used to automatically generate the first sentence of the poem. A phrase-based statistical machine translation model then generates additional sentences one-by-one. With our interactive model, the user can select the best sentence from the system’s N-best output at each step. The approach has been evaluated on the generation of quatrains of 5- and 7-character lines. The evaluation metrics for single lines as well as for the whole generated poem suggest that this method is very promising.
    Key wordsartificial intelligence; machine translation; statistical machine translation; poem generation; poem evaluation
  • Review
    DU Jun, DAI Lirong, WANG Renhua
    2010, 24(2): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose a new feature normalization approach for robust speech recognition. It is revealed that the shape of speech feature distributions is changed in noisy environments compared with that in the uninterrupted condition. Accordingly, the Cepstral Shape Normalization (CSN) which normalizes the shape of feature distributions is performed by exploiting an exponential factor. This method has been proven effective in noisy environments, especially under low SNRs. Experimental results show that the proposed method yields relative word error rate reductions of 38% and 25% on aurora2 and aurora3 databases, respectively, in comparing with those of the conventional Mean and Variance Normalization (MVN). It is also shown that CSN consistently outperforms other traditional methods, such as Histogram EQualization (HEQ) and Higher Order Cepstral Moment Normalization (HOCMN).
    Key wordscomputer application; Chinese information processing; robust speech recognition; shape normalization
  • Review
    ZHANG Feng1, HUANG Chao2, DAI Lirong1
    2010, 24(2): 110-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    The current automatic mispronunciation detection systems are mostly based on automatic speech recognition (ASR) framework with statistical model. This paper presents the methods to improve the performance of mispronunciation detection at syllable level for Mandarin Chinese from two aspectsintroducing the speaker adaptive training (SAT) and the selective maximum likelihood linear regression (SMLLR) to get a better acoustic statistical model, and proposing speaker normalization backend because of the limited information and the different rating level for the different pronunciation level. Experiments on a database of 8 000 syllables pronounced by 40 speakers with varied pronunciation proficiency indicate the promising effects of these strategies by improving the precision from 45.8% to 53.6% at 30% recall, and 64.6% to 79.9% at 10% recall.
    Key wordscomputer application; Chinese information processing; Automatic mispronunciation detection; Speaker Adaptive Training (SAT); Selective Maximum Likelihood Linear Regression (SMLLR); speaker normalization;
  • Review
    GU Shaotong1,2,3
    2010, 24(2): 116-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    A character image restoration of Jiagu rubbings method based on adaptive threshold and fractal geometry is proposed in this paper. The paper analyzes the characteristics of the image noise and the edges of the characters on Jiagu rubbings. Firstly, we estimate the adaptive threshold by means of Bayes risk function and clear the noise regions. Then we calculate the fractal dimension of the character edge on Jiagu rubbings by means of statistics. Finally, we perform the transformation to the character edges so as to smooth the character edges of Jiagu rubbings image. The experimental results show that the proposed method could smooth the character edge of Jiagu rubbings significantly.
    Key wordscomputer application; Chinese information processing;Jiagu rubbings; adaptive threshold; fractal geometry; fractal dimension; compression transformation; character image restoration
  • Review
    WANG Kunlun1,ZHANG Guanhong1, Turghunjan Abdukirim 2
    2010, 24(2): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a Turkic Altaic language, Uighur has a unique word-building in which its eight vowels have very important roles for speech recognition and synthesis, especially for selection of recognition base unit. Focused on the acoustic frequency characteristics and the formant frequency parameters of the Uighur vowels, this paper adopts the basis theory and method of experimental phonetics to reveal the vowels’formant frequency distribution rules on the Uighur synthetic speech database(office environment). The accuracy of the formant frequency distribution parameter of Uighur’s eight vowels is further validated in the speech recognition test. The experiment confirms that the Uighur is more distinguishable in the audio frequency for the correctness in the speech transmission and receival if removing the vowel harmonious phenomena.
    Key wordscomputer application; Chinese information processing;speech recognition; acoustic frequency characteristics; formant frequency; vowel; Uighur