2007 Volume 21 Issue 4 Published: 15 August 2007
  

  • Select all
    |
    Review
  • Review
    DONG Zhen-dong, DONG Qiang, HAO Chang-ling
    2007, 21(4): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    It was over 8years since the release of the first version of HowNet. Lots of people both at homw and abroad have already been familiar with it. Thus it is thought to be high time for us to discuss its theoretical issues. The paper elaborates the following theoretical findings: (1) HowNet’s understanding of knowledge, (2) the acquisition and representation of knowledge, (3) the biaxial theory of event classification, (4) semantic roles, (5) HowNet’s knowledge database mark-up language. The paper also presents the powerful capability of computation of meaning and the latest achievements in the development. HowNet will become the infrastructure of some human language technologies, such as natural language search.
  • Review
    YUAN Yu-lin
    2007, 21(4): 10-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper firstly discusses three kinds of level in the fineness hierarchy of semantic roles, and introduces their application in some systems of natural language processing (NPL). Then it introduces three approaches to treat semantic roles in three semantic resources for NPL: (i) the frame elements in the semantic frame which is situation based -- the semantic roles in the FramNet of California University at Berkeley; (ii) the numbered pro-typical arguments which is special verb based -- the semantic roles in the PropBank of Pennsylvania University; (iii) the thematic roles of arguments which is special predicate (verb or adjective) based -- the semantic roles in the NetBank of Peking University. Finally, these three semantic resources (i.e., annotation corpora) are compared from their goals, methodologies, annotation contents and constitutions.
  • Review
    LUO Qiang, XI Jian-qing
    2007, 21(4): 21-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose a SVM-combined generative statistical model for Chinese dependency analysis that trains SVM classifier using erroneous results generated by generative statistical model. To further improve the precision of dependency analysis, two measures were taken, first, dynamic programming algorithm that extends the range of finding the best local solution was used to estimate the error rate of generative model; second, a ranging factor was introduced to make the solutions adaptive on the practical situation. All those efforts make it possible for the new method to largely decrease the number of negative support vectors without sacrificing classification ability in training. Comparative experiments on Hit Chinese Treebank corpus show that the new method shows better performance than current Chinese dependency methods, with precision reaching to 86.4%.
  • Review
    LI Jia, ZHU Ming, LIU Chen, YANG Zheng-qiu
    2007, 21(4): 27-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ontology heterogeneity is the problem treat to resolve in implementation of semantic web, and ontology mapping is one of the solutions. Chinese resources is an important part of information networks, and the implementation of ontology mapping between Chinese and other languages plays an important role in ontology sharing, reuse and cooperation. However, there are rare mapping researches on ontology presented in Chinese, most of them are focus on ontology in English. So this paper, considering the element level of Chinese ontology mapping, proposes an algorithm which makes use of the similarity between conceptions to get mapping between ontologies. We also developed a mapping system named ELOMC(Element Level Ontology Matching for Chinese). ELOMC uses HowNet as thesaurus and combines many kinds of technologies to get the similarity of words.
  • Review
    LIU Hua
    2007, 21(4): 34-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    Improvement in text categorization lies not on algorithm of classing model, but on the fundamental element: integrated and independent feature of text representation. Key phrases are phrase that have strong text representation function, can characterize text content such as subject and kind. With steady structure, integrated meaning and statistical significance, Key phrases can overcome the limitation of VSM (Vector Space Model) and NB (Naive-Bayes), are fit for feature of text representation, and are propitious to improving effect of text categorization. From linguistics, cognitive psychology and computational linguistics, we searched the base of theory of Key phrases’ advantage, defined Key phrases, and acquired them by extracting key words labeled by specialist in web pages. The experiment proved that Key phrases are fitter for feature of text representation than words: Micro F1 increase of 3.1 percent of parent-category, Micro F1 increase of 15 percent of sub- category.
  • Review
    SHI Shui-cai, CHENG Tao, WANG Xia, LV Xue-qiang
    2007, 21(4): 42-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    Webpage-advertisement matching is the core technology of online advertisement based on the content, and the paper presents a semantic approach, with a goal of achieving webpage-advertisement matching accurately. Firstly, thematic information must be extracted from a webpage, and then thematic words are calculated. Extend the thematic words by looking up their similar words, upper words, lower words, related words, and finally choose advertisements which have highest matching rate. The method is implemented and tested, and the result shows that the proposed arithmetic is promising.
  • Review
    WANG Xiao-leng, WANG Bin
    2007, 21(4): 48-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Webpage classification can be regarded as a text classification problem under noisy environment. This paper aims at doing an exploratory research in this field. We re-examine three classifiers: Bayes based on N-gram model classifier (NGBayes), Nave Bayes classifier (NBayes) and k-Nearest Neighbor classifier (kNN), which almost have the same performance in traditional text classification field. Two corpora are used for this study: CCT2002-v1.1 (Corp_1) provided by Chinese Web Information Retrieval Forum and another Chinese webpage corpora (Corp_2) collected by ourselves. The conclusion that these classifiers have comparable performance under non-noisy conditions is validated. The experiment results show that NGBayes greatly outperforms NBayes and kNN under noisy environment, so NGBayes is least insensitive to noisy information among these classifiers. Deep analysis explains why NGBayes is better. Thus the conclusion is drawn: The NGBayes is an antinoise chinese webpage classification method.
  • Review
    HE Ting-ting, DAI Wen-hua, JIAO Cui-zhen
    2007, 21(4): 55-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    K-Means Clustering Algorithm is sensitive to the choice of the initial cluster center, easy to fall into a local optimal solution. In order to avoid this kind of flaw, we proposed Hybrid Parallel Genetic Algorithm. In this method, we expressed the documents set into Vector Space Model and randomly chose initial clustering centre to form chromosome among document vectors, then combined the efficiency of K-means Algorithm and the global optimization ability of Parallel Genetic Algorithm. Through heredity, variation in the community, and parallel evolution, getting married between communities, we can provide a higher efficiency and precision for text clustering. Experiments indicate that Hybrid Parallel Genetic Algorithm has higher accuracy and global optimization ability relative to the others text clustering method for example K-Means Algorithm, Genetic Algorithm and so on.
  • Review
    HUANG Xiang-lin, GAO Yun, YANG Li-fang, WANG Peng-peng
    2007, 21(4): 61-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    A Chinese document image retrieval method by keywords is proposed, which retrieved Chinese character directly from Chinese character image without OCR (Optical Character Recognition). At first, Chinese character image was segmented from Chinese document image. Then the feature data of Chinese stroke were extracted from the Chinese character image. At last, the similarity of the Chinese character images were measured by weighted modified Hausdorff distance between their feature data. That retrieval method is robust to character size and font. The experimental results show good performance.
  • Review
    HOU Hong-xu, LIU Qun, Nasun Urt
    2007, 21(4): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    We have presented an example based Chinese-Mongolian machine translation method, and implemented it. The method is insist of several parts, includes example searching, segment splitting, matching and recombining. The method is based on word alignment. It is using word alignment information for segment matching, and computing the similarity by the number of matching words and length, and selects the best example. Using word alignment information, determined the method of segment recombining, and generates the translation result.
  • WANG Xiao-rui,DING Peng,LIANGJia-en,XU Bo
    2007, 21(4): 73-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The purpose of language model (LM) adaptation is to reduce the linguistic mismatches between t raining corpus and recognition tasks. This paper proposed a new noniterative new words ext raction approach for Chinese and a novel open-vocabulary Chinese LM. To reduce lexicon mismatch , topic and style mismatch and ngram dist ribution mismatch , we also present a unified LM adaptation f ramework which combines our noniterative new words ext raction approach , a novel open-vocabulary Chinese LM , a perplexitybased corpus selection approach and an ngram dist ribution adaptation module. The recognition accuracy of name entity words is also analyzed as an effect of LM adaptation. Experiment s showed about 10 % relative character error rate reduction and 4 % (absolute) recognition accuracy increase of name entity words.
  • DONGJing,SUN Le,FENG Yuan-yong,HUANG Rui-hong
    2007, 21(4): 80-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity Relation Extraction is one of the important research fields in Information Ext raction. This paper present s a novel method through dividing the entity relations into two categories : embedding relations and non-embedding relations. After some simple experiments , we discover that some syntactic features have explicitly different effects on the identification of the two kinds of relations. So two different set of syntactic features are suggested to extract the two categories. Experiment s show that the new method achieves an improved performance on the ACE2007 Corpus for Chinese entity relation extraction task.
  • ZHANG Zhi-wei,KONG Fan-rang,LIU Wei-lai,LONG Qian,LIU Yong-bin
    2007, 21(4): 86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Extraction of mathematical expressions is the first step of mathematical expressions recognition. A new approach for separating both isolated and embedded expressions in printed Chinese technical document s is presented. After the features of text lines are ext racted , ANFIS is used to classify the text lines into two classes : lines of text and lines of isolated expressions. For embedded expressions , Fuzzy clustering and dynamic programming algorithm are applied to ext ract Chinese Characters , Chinese punctuations and English letters in sequence. Atlast , heuristic rules are used to merge mathematics into expressions. The methods proposed are proved to have high accuracy by experiment s.
  • LIU Qing2sheng,WEI Si,HU Yu,GUO Wu,WANG Ren2hua
    2007, 21(4): 92-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the popularization of Putonghua , the demand for testing and learning by the aid of computer is even more and more intensive. In this paper , we proposed an improved phone2based quality assessment algorithm aiming at Putonghua pronunciation. The linguistic knowledge of Putonghua pronunciation was effectively int roduced into the calculation of HMM based log posterior probability. Comparing with t raditional method , the proposed one not only reduces the complexity of calculation , but also achieves an improvement of correlation between machine and hu2 man scores f rom 0. 704 to 0. 795 on a database of 303 persons recorded f rom spot Putonghua Shuiping Ceshi.
  • ZHANG Sen,HUA Shao-he
    2007, 21(4): 97-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    The automatic t ranscription , annotation and ret rieval of broadcasting news requires automatic speech recognition , natural language processing and information retrieval technologies. The state-of-the-art of broadcasting news automatic annotation and retrieval progress were discussed and the related key techniques were analyzed ; then an approach of multi-level automatic annotation frame for Mandarin broadcasting news and ret rieval method based on that annotation frame were presented ; the annotation attributes for document level , utterance level and word level were investigated , the recursive method for multi-level annotation was proposed ; Furthermore , the speech recognition engine and audio st ream media segmentation problems which are closely related the speech annotation problem were investigated ; the proposed approaches were applied to 102hours’ Mandarin broadcasting news for annotation and ret rieval , the experiment result s were satisfactory.
  • ZHANG Wei,SUN Le,FENG Yuan2yong,LI Wen2bo,HUANG Rui2hong
    2007, 21(4): 105-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese input method is one of the key challenges in Chinese information processing. With the rapidly in2 crease of the number of Chinese web surfers , the efficiency of the Chinese input method has becomes more and more important . Based on observations of the long2term dependencies in sentences , we implemented a collocation2based pinyin input system by using the collocations we ext racted f rom large2scale corpus. This system has the ability to capture the long2term word collocations. The idea is further int roduced into our personalization module of our Pinyin system to help the user input Chinese more efficiently. The experiment result s show the methods we propose in this paper are promising.
  • J IANG Di
    2007, 21(4): 111.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discribes clause-object s of Tibetan cont rolled by narrate verbs. The types of verbs include nar- rating verbs , cognitive verbs , thought verbs , inquiring verbs and other related semantics of the verbs. From it s own sentence st ructure , it could be a complete sentence with subject , predicate , the end of the sentence and modal verb physical markings , perhap s just a single-verb. The clause-object s has it s own predicate and ought to be nomi- nalized by adding nominalized markers. The special marker " zer" and it s variant s in pronounciation and written forms come f rom nominalization of verb "zer" . There are a complex relationship and layers similar to English Direct and Indirect Speech in the clause-object s. The agent of the clause will be identified through functional words of as- pect s , models and context s under the default of subject s of clause-object s. The types of the clauses may be declara- tive , interrogative , imperatives and exclamatives , therefore , they take different types of mood words. What is wor- thy to be pointed out is the absence of some nominalized markers in clause-object s , which will make some t roubles for syntactic algorithm , and need remain to be discussed.
  • SU Guo-ping, MIAO Cheng,XIA Guo-ping
    2007, 21(4): 116-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper aims to study the auto-transfiguration mechanism of some minority languages characters , including Uighur , Kazakstan and Khalkhas. The characteristics of those minority languages were analyzed firstly. Then the st ructure and process-steps of the newest Open-Type font technique were briefly int roduced. Based on those characteristics and technologies , a general auto-t ransfiguration engine was designed. And via analyzing the connect type of the letters , an auto-transfiguration rules database was const ructed. According to those rules , the binding of the type-label was realized. Integrating the Open-Type font interpret engine and type-label , the replace and displace of characters were actualized. Finally this auto-type-choose engine was implemented in U Uighur-Kazakstan-Khalkhas version of Evermore office sof tware. Bypass the application test , this engine can fully meet the needs of t ransfiguration of these minority languages’letters.
  • CHEN Zhuang
    2007, 21(4): 122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Standardization is the basis of indust rialization of technology. Chinese information processing technology is unique in the world and China is now playing a leading role in this area. China has made remarkable achievement s in scope of international standardization of Chinese coding in activities of ISO/ IEC J TC1/ SC2 since the 1980s. This paper int roduces to readers the working scope , the working mechanism and the st ructure of ISO/ IEC J TC1/ SC2. This paper describes by what means China participates in the activities of ISO/ IEC J TC1/ SC2 and it s subgroup s. This paper briefly int roduces an international standard ISO/ IEC 10646 to readers. This paper int roduces achievement s , current work and future plan of China on ISO/ IEC 10646. This paper also int roduces China’ s status in ISO/ IEC J TC1/ SC2 and it s sub groups and significances of China’s participation. At last , the author of this paper gives some suggestions to people who are interested in Chinese future work in ISO/ IEC J TC1/ SC2.