2005 Volume 19 Issue 4 Published: 15 August 2005
  

  • Select all
    |
  • ZHANG Xiao-fei1 ,CHEN Zhao-xiong ,HUANG He-yan ,HU Chun-ling
    2005, 19(4): 2-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    Example-based machine translation is currently difficult in large-scale implications because of its low translation coverage. In this paper , an algorithm of generalizing match of translation examples is proposed to improve the translation coverage of EBMT: the candidate translation examples are generalized in real time controlled and guided by the input sentence which to be translated. The algorithm not only can satisfy the speed of real time documents translation but also can use the new language knowledge which added and revised by users in the translation processing. So a higher translation coverage and translation quality is obtained as a whole. The positive experiment results of 75 % translation coverage basis on 160 ,000 pairs of translation examples confirm the algorithm’s effect.
  • CHEN Hao ,HE Ting-ting ,JIDong-hong
    2005, 19(4): 11-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    An unsupervised WSD(word sense disambiguation) can avoid big labor cost and it is possible to adjust to deal with large-scale ,so WSD has extensive applications in many fields. This paper presents an unsupervised approach which constructs context vector by means of second-order context , clustering by k-means and disambiguates by calculating the similarity. Our experiments are based on the extraction of term and average accuracy is 82. 62 % and 80. 87 %for 8 ambiguous words in open test by this method.
  • GAN Jun-wei ,HUANGDe-gen
    2005, 19(4): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a hybrid algorithm to identify Chinese prepositional phrase. The algorithm is composed of two steps. Firstly , the algorithm extracts reliable frames automatically according to the frame templates which consist of the prepositions and the right border of prepositional phrases. Then it identifies part of the prepositional phrases using these frames.Secondly the algorithm integrates a statistical model based on part-of-speech and rules to identify the prepositional phrasesthat haven’t been tackled in the first step. The cross-validation test evaluated on a manually annotated corpus containing7323 prepositional phrases shows promising results : the precision is 87148 % and recall is 87127 %.
  • LI Guo-chen , LUO Yun-fei
    2005, 19(4): 25-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    Anaphora is a common phenomenon in the research on NLP (Natural Language Processing) , Anaphora resolution plays an important role in text information processing. With the increasing development of dealing with the discourses , anaphora resolution shows the unprecedented importance. In this paper , according to features of Chinese personal pronoun we present an approach which is based on corpus. It adopts the decision tree arithmetic and combines with the preference selection strategy. The method takes into account all kinds of anaphoric features and those effects among each other. The experiment demonstrates that the method achieves the desired result.
  • WANGJian-hui ,WANGLei ,HU Yun-fa
    2005, 19(4): 32-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to identify the dependent relationship between words based on statistics efficiently and accurately , this paper has rectified part of the shortcomings of present algorithms by making the best of the distribution characteristic between words , distinguishing the collocation , coordinate and affiliation relationship between words , identifying them respectively by different strategies , presenting a new module of matching between strings and a new module of dependent intensity between words , constructing the tree of dependent relationship , pruning the constructed tree of dependent relationship and identifying some latent dependent relationship. The experiment confirmed that , the new algorithm can identify the dependent relationship between words very accurately.
  • YUAN Yu-Lin
    2005, 19(4): 40-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper demonstrates how to use the knowledge of logic and discourse structure to restrain the template-matching in information extraction (briefly , IE) , and to recover the missing information items or ones expressed by pronouns or deixis. It firstly explains what is the knowledge of the argument structure2based logic structure and discourse structure. Then it illustrates how the negative and aspect operators can change the type of event of a sentence and the matching relation between the sentence and the related event-template. And it shows how the embedding and nominalization of argument structure can change the syntactic position of some arguments and the related information items. Finally , it discusses how to use the knowledge of discourse structure to recover the missing information items or ones expressed by pronouns or deixis.
  • HUANG Yong-wen ,HE Zhong-shi
    2005, 19(4): 47-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Smoothing techniques are mainly used to solve the problem of sparse data for statistical language model. The present smoothing techniques have solved the data sparse problem effectively but have not further analyzed the reasonableness for the frequency distribution of events occurring. This paper presents a new kind of smoothing technique based on the mutual information for Bi-gram model. The model parameters , probabilities for bigram , are discounted or compensated according to the mutual information , whose rationality is indicated by minimizing the perplexity. The experimental results show that this technique outperforms the commonly used Katz smoothing technique.
  • CHEN Xiao-Yun,HU Yun-Fa
    2005, 19(4): 53-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recently , categorization methods based on association rules have been given much attention. In general , association classification has the higher accuracy and the better performance. However , the classification accuracy drops rapidly when the distribution of feature words in training set is uneven. Therefore , text categorization algorithmWeighted Association Rules Categorization (WARC) is proposed in this paper. In this method ,rule intensity is defined according to the number of misclassified training samples. Each strong rule is multiplied by factor less than 1 to reduce its weight while each weak rule is multiplied by factor more than 1 to increase its weight. The result of research shows that this method can remarkably improve the accuracy of association classification algorithms by regulation of rules weights.
  • WAN Zhong-ying ;WANGMing-wen ;LIAO Hai-bo
    2005, 19(4): 61-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid growth of the World Wide Web (www) , there is an increasing need to provide automated classifier to Web users for Web page classification and categorization. In this paper , we propose a new Web-page classification algorithm based on projection pursuit for improving the accuracy. We first seek the best projection direction using the genetic algorithm, and the Web-document (represent by n-dimension vector) is projected to One-dimension space. Then classify the Web-document using classical KNN (k-nearest neighbor) algorithm. This method can overcome the curse of dimensionality.Experimental results show that our proposed algorithm is feasibility and effectiveness.
  • YUAN Fu-yong ,CHU Pei-pei
    2005, 19(4): 69-72,78.
    Abstract ( ) PDF ( ) Knowledge map Save
    In information retrieval systems based on the vector space model , the TF2IDF scheme is widely used to characterize documents. However , in the case of documents with hyperlink structures such as Web pages , it is necessary to develop a technique for representing the contents of Web pages more accurately by using the contents of their hyperlink neighboring pages. VSM is analyzed to find the reason for its low precision , and propose an approach by using the contents of hyperlink neighboring pages. The experiment results show that the algorithm is effective. The precision rate promotes 10 %.
  • LIU Shuan , MENGQing-chun
    2005, 19(4): 73-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    The improved algorithm based on genetic algorithm and BP algorithm is used for image matching of licence plate.It combines virtues of the two kinds of algorithms : first optimizing the BP networks learning process by genetic algorithm ,then accurately training it using the BP algorithm , according to the obtained weights and thresholds , the matching result are obtained. Experimental results show that this improved image matching algorithm can better meet matching requirement. And in matching rate it can reach with 92 % , traditional BP algorithm only reach 79 %. On matching speed the improved algorithm is also better than other algorithm. It has remarkable improving in accuracy , convergence , matching speed.
  • SUN Quan-sen, J IN Zhong , HENG Pheng-ann , XIA De-shen
    2005, 19(4): 79-84,89.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new method of combined feature extraction , based on the idea of feature fusion , is proposed in this paper. The theory of canonical correlation analysis (CCA) in consideration of pattern classification have generalized. A framework of generalized canonical correlation analysis (GCCA) used in pattern recognition is established. In this framework , first of all , based on generalized canonical correlation discriminant criterion , solve the generalized projective vectors of the two groups of feature vectors to compose transformation matrix. Then , using a new feature fusion strategies to fuse two existing features of Handwriting Chinese Character , and the correlative feature matrix of same pattern sample is extracted. The correlative feature matrix extracted show the essence feature of Handwriting Chinese Character. In generic classifier , we have obtained the good experimental results , our recognition rate is far higher than that of the PCA method and FLDA method adopting single feature.
  • XIAO Shu-cai , OU Zhi-jian , WANG Zuo-ying
    2005, 19(4): 85-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper , We introduced a speaker clustering algorithm in speech recognition , which includes its effect to the recognition system. Also , its usage , the features used , distance measurement and the procedure of the algorithm were described. To evaluate the effectiveness of the algorithm , we do two kinds of experiments. One is by calculating the clustering correction rate directly and the other is by comparing the word error rate (WER) of the recognition system under two different conditions : whether using the speaker clustering algorithm or not. From the experiments , we can see that the sentence clustering correction rate is reached 85169 % when using the GLR distance measurement. In the recognition experiment , the performance of the system improves a lot , that the word error rate is very near that of the system by using the known speaker information to do the speaker adaptation.
  • XU Wei-qun ,XU Bo ,Huang Tai-yi
    2005, 19(4): 90-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper investigates how to identify utterance topics in spontaneous spoken dialogues based on some shallow semantic analysis. First the topic of an utterance is defined as the salient semantic entity its speaker focuses hisPher attention on. Then we discuss two features of such a topic (i. e. , topic as discourse construct and topic continuity) and the relationship between utterance topic and (extended) sentence type. According to these an algorithm is established to identify utterance topic and evaluated in a spoken dialogue corpus. The results achieve an accuracy of 6111~8716 % , depending on different sources of extended sentence type and different accuracy definitions.
  • XIE Qian,WU Jian,SUN Yu-fang
    2005, 19(4): 98-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ethnic language support in Linux should be based on internationalization ( I18N) mechanism. In this paper , after summarizing the hierarchical structure of Linux I18N framework , several crucial issues related to X window core system are analyzed. Necessary modifications for adding ethnic language support in Xwindow core system are systemically enumerated , exemplifying by the practice of adding Tibetan support. The implementation in related project is evaluated. Along with the research on other ethnic language system, limitations are also analyzed in detail , coming up with the future working direction.