2007 Volume 21 Issue 2 Published: 16 April 2007
  

  • Select all
    |
    Review
  • Review
    WU Yun-fang, JIN Peng, GUO Tao
    2007, 21(2): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a simple but effective feature-based approach to Chinese word sense disambiguation using the distributional features available from the Grammatical Knowledge-base of Contemporary Chinese. The test data is the sense-tagged corpus of People’s Daily. A Nave Bayes classifier is also tried as a comparable statistical method. The feature-based approach achieves precision of 90%, which is comparable to the NB classifier. The striking advantages of the feature-based approach are 1) It is not influenced by the data size, and 2) It can disambiguate some specific words with precision of 100%. The features appropriate for different parts of speech in Chinese WSD are also discussed. This paper demonstrates that sense features described in the lexicon are worth including in WSD.
  • Review
    GUO Yong-hui, WU Bao-min, WANG Bing-xi
    2007, 21(2): 9-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    Part-of-speech (POS) tagging approaches always utilizes linguistic knowledge described from one perspective. Based on the research of four kinds of POS tagging methods, such as, TBED, DT, HMM and ME, we propose a novel data fusion strategy for POS tagging--- correlation voting method. The result of experiment shows that linguistic knowledge of POS tagging can be more roundly described by applying data fusion, and the correlative voting is better than other fusion methods for an average decrease of 27.85% in tagging error rate.
  • Review
    LIU Fei-fan, ZHAO Jun, XU Bo
    2007, 21(2): 14-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity recognition plays a significantly important role in many natural language processing applications. Previous study on entity recognition is mainly focused on the Named Entity Recognition (NER) and nested NEs are not considered. This paper proposes a multi-scale nested entity mention recognition system in the context of ACE(Automatic Content Extraction), which aims to identify named, nominal,pronominal mentions of entities within unstructured texts and assign multiple attributes for all the mentions. We separate this task into two subtasks: multi-scale nested boundary detection and multiple information recognition. First, we propose a information encoding method for nested structure which provides an effective solution to recast the multi-scale nested boundary detection problem to the classical sequential labeling problem. Second, a parallel two-agent classifier is presented to conduct multiple information recognition for each entity mention. Furthermore, abundant multi-level linguistic features are integrated in our machine learning based framework to achieve competitive performance. We evaluate the proposed framework on ACE standard corpus by extensive experiments and obtain the accuracy of 71% for nested boundary detection, the accuracy of 89.05%, 82.17% for the two classification agents respectively.
  • Review
    MAO Ting-ting, LI Li-shuang, HUANG De-gen
    2007, 21(2): 22-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a hybrid model and the corresponding algorithm combining support vector machines (SVM) with statistical methods to improve the performance of SVM for the task of Chinese person names recognition. In this algorithm, a training set is obtained by extracting the attributes of feature vectors based on characters and the SVM model of automatic identification of Chinese person names is set up by choosing a proper kernel function. Thus a threshold of the distance from the test sample to the hyperplane of SVM in feature space is used to separate SVM region and statistical method region. If the distance is greater than the given threshold, the test sample is classified using SVM; otherwise, the statistical model is used. The experimental results show the recall, precision and F-measure for recognition of Chinese person names based on the hybrid model are up to 91.96%, 94.62% and 93.27% respectively in open test. Compared with sole SVM, the F-measure increases 1.51%. By integrating the advantages of two methods, the performance is obviously improved.
  • Review
    MAO Qi, LIAN Le-xin, ZHOU Wen-cui, YUAN Chun-feng
    2007, 21(2): 29-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    So far, most syntactic parsers neglect the punctuations or oversimplify their functions. However, it is actually very important information of syntactic characters. According to the features of punctuation in the syntactic structure, this paper proposes a kind of new concept of separate parsing phrase, and according to the typical character and the position of punctuation in a sentence, this paper also presents one way to identify the separate parsing phrase based on the decision tree algorithm (Id3). In this paper, the punctuation is integrated into syntactic analysis. All the experimental data sets, including the training data and test data, are derived from the Chinese Penn Tree Bank 5.0. The experiments have been done solely using the sentences, the length of which is over 40 Chinese words. The results indicate that the accuracy and the recall rate have been improved by 1.59% and 0.93% respectively, and the time expense has been reduced by nearly 66.6%. The results show that the punctuation is quite useful and effective to parse the long sentences in Chinese.
  • Review
    WANG Can-hui, ZHANG Min, MA Shao-ping
    2007, 21(2): 35-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Natural language processing (NLP) has been used in information retrieval (IR) by researchers, in the hope of improving retrieval effect. But most of the results are in the opposite way hypothesized. In most cases, NLP didn’t yield increases in IR precision but took a negative effect. Even if NLP helped IR under some circumstances, the improvements were much smaller than the processing cost needed by NLP. Researchers perform analysis on these phenomena and come to the conclusion that: IR-related tasks that acquire accurate results, such as question answering (QA) and information extraction (IE), are more suited for the use of NLP. NLP needs to be optimized for IR in order to be effective. Recent research, e.g. adding NLP factors to language model, has more or less confirmed the conclusion.
  • Review
    LIU Jia-bin, HU Guo-ping, CHEN Chao, SHAO Zheng-rong
    2007, 21(2): 46-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question Answering System can give users precise answer to the question presented in natural language and the major factor which influence the System’s performance is the scale of Question-Answer pairs. In order to increase the Question-Answer pair’s scale and make full use of Web Pages’ resource, in this paper we propose a method that uses decision tree and Markov model to extract Question-Answer pairs in Web Pages. The method uses DOM tree to represent a web page according to HTML tags. Then acquire features value from every DOM tree’s node. Last allow the features overpass the classification model, which created by decision tree and Markov model, to get the node’s last classification result. Experimental results show that the precision achieved 90.40% and recall achieved 86.03%. Experimental results also show that this model could extract information from all kinds of Web Pages.
  • Review
    SU Jin-song, ZHOU Chang-le, LI Yi-hong
    2007, 21(2): 52-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    The annotated corpus of Song Dynasty poetry is the foundation of the computer-based study of Song Dynasty poetry. In our paper, we propose a new definition of “word” in the Song poetry and a new method for the establishment of the annotated corpus. Two available methods, statistical word extraction and segmentation based on rules and forms, are taken into consideration. The former is adopted to extract closely combined two-character words and establish word lists combining with related resources. And the latter, combined with the word lists, is used to segment Song Dynasty poetry. It is showed by the experimental results that the method applied in the paper is effective.
  • Review
    ZHAO Shi-qi, LIU Ting, LI Sheng
    2007, 21(2): 58-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Few of the existing document clustering methods can detect or describe document topics properly, which makes it difficult to conduct clustering based on topics. In this paper, we introduce a novel topical document clustering method called Linguistic Features Indexing Clustering (LFIC), which can identify topics accurately and cluster documents according to these topics. In LFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, linguistic features are exploited. Experimental results show that LFIC can gain a higher precision (94.66%) than some widely used traditional clustering methods.
  • Review
    HUANG Yong-guang, LIU Ting, CHE Wan-xiang, HU Xiao-guang
    2007, 21(2): 63-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses mainly about the short texts, which occurs on mobile short messages and chat rooms. Because of their irregular style and similarity, we call them abnormal texts. We propose an efficient clustering algorithm based on the duplication information deletion algorithm. It concerns about the features of the abnormal short texts and takes some special methods such as extracting feature code and compressing code to solve this problem. Experiments show that the clustering system based on this algorithm can depose millions of abnormal short texts per hour with high accuracy.
  • Review
    WU You-zheng, ZHAO Jun, XU Bo
    2007, 21(2): 69-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper presents an unsupervised learning algorithm to learn answer pattern for answer extraction module of Chinese Question Answering (QA). Given two or more questions of one question type, the algorithm can learn the corresponding answer patterns from internet via web search, topic segmentation, pattern extraction, vertical clustering and horizontal clustering, etc. The experimental results show that the performance of pattern-based answer extraction of Chinese QA is improved significantly.
  • Review
    ZHOU Jun-sheng, , HUANG Shu-jian, CHEN Jia-jun, QU Wei-guang
    2007, 21(2): 77-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference resolution plays an important role in natural language processing. Facing the fact that the Chinese training corpus for coreference resolution is heavily lacking, this paper presents a new unsupervised clustering algorithm for noun phrase coreference resolution. In this approach, the problem of coreference resolution is firstly converted as a graph clustering problem, and then an objective function called the modularity function, which allows automatic selection of the number of clusters, is selected for graph clustering. The proposed algorithm does not make pairwise coreference decisions independently of each other. The experimental results on the Chinese ACE training corpus demonstrate that the proposed method is a feasible unsupervised algorithm for noun phrase coreference resolution.
  • Review
    CHANG E, HOU Han-qing, CAO Ling
    2007, 21(2): 83-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    The automatic version comparison and analysis of ancient book is that the difference among different versions of ancient book is automatic found and marked down by computer, and give help to critic by the assisted tools. Firstly, the article expands on sense in automatic version comparison and analysis of ancient book. Secondly, the article addresses system of automatic version comparison and analysis in detail, including subject selection, data collection, object, algorithm of automatic version comparison and analysis, and system master plan. Finally, we discuss thoroughly how to develop the assisted tools, including the list of ancient official title, personal name and place name, etc. The experimental result shows that the recall is 92.3% and the precision is 95.2%.
  • Review
    CHEN Guang, ZHANG Hong-gang, GUO Jun
    2007, 21(2): 89-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    A new feature extraction method contributing to improvement of the performance of a handwritten Chinese character recognition system is presented. By using enhanced weighted dynamic meshes based on nonlinear normalization, this method not only avoids the zigzags and other undesirable side effects introduced in the original Yamada et al.’s nonlinear normalization method but also avoids additional feature normalization process in the original Lian-Wen Jin et al.’s and WU Tian-lei et al.’s dynamic mesh method. Experiment on HCL2000 shows that our method achieves superior performance.
  • Review
    CAI Lian-hong, CUI Dan-dan, CAI Rui
    2007, 21(2): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper states our work which focuses on the building and analysis of corpus for Mandarin Text-to-Speech System, named TH-CoSS. The text script consists of four parts: sentences for TTS system building, sentences for TTS system evaluation, special syllable groups, and sentences with special sentence type to convey special intonation. The finished corpus has about 20K sentences read by one female and one male. The annotation files are in XML format, including segmental and prosodic tags. Software tools are developed as well. On the basis of the syllables in TH-CoSS, an analysis of the influences of context features on the prosody of speech is carried out.
  • Review
    JIANG Wei, GUAN Yi , WANG Xiao-long, LIU Bing-quan
    2007, 21(2): 100-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to overcome the difficulty in fusing more features into n-gram, a Pinyin-to-Character conversion model based on Support Vector Machines (SVM) is proposed in this paper, providing the ability of integrating more statistical information. Meanwhile, the excellent generalization performance effectively overcomes the overfitting problem existing in the traditional model, and the soft margin strategy overcomes the noise problem to some extent in the corpus. Furthermore, rough set theory is applied to extract complicated and long distance features, which are fused into SVM model as a new kind of feature, and solve the problem that traditional models suffer from fusing long distance dependency. The experimental result showed that this SVM Pinyin-to-Character conversion model achieved 1.2% higher precision than the trigram model, which adopted absolute smoothing algorithm, moreover, the SVM model with long distance features achieved 1.6% higher accuracy.
  • Review
    ZHANG Shi-lei, ZHANG Shu-wu, XU Bo
    2007, 21(2): 106-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a two level unsupervised method for audio segmentation that detects acoustic changes of speaker, environment and channel in a continuous audio stream effectively. In our approach, we divide the change detection process into two levels: region level that detects the potential change regions containing candidate acoustic change points, and boundary level that searches and refines the true change points. At the region level, we employ the modified Generalized Likelihood Ratio metric to search for the potential change regions in continuous local windows without setting any threshold. At the boundary level, we perform T2 and Bayesian Information Criterion algorithm to detect segment boundaries within the potential windows. The experimental results on the 1997 Broadcast News Hub4-NE mandarin corpus show the proposed scheme can get nearly 10.5% recall rate increase.
  • Review
    LU You-fei, ZHANG Wei, ZHANG Yan, MIAO Cheng, LI Chun
    2007, 21(2): 112-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is of great value to the development of information in the minority area to design the office with Uighur,Chinese and English multilanguage. In view of above, this article first introduces Uighur characteristic, then analyzes and realizes the essential techniques including Automatic graphy selection , Breaking the line according to the syllable and automatic lengthening and so on in the design of Evermore Integrated Office Uighur version, after these essential techniques take into application in Uighur version Office , it causes the Uighur typesetting to be extremely neat through the test. Simultaneously these essential techniques have general guidance to the Ughur scripts processing, also to other Uighur language software development.
  • Review
    MENG Fan-qiang, WU Jian, JIA Yan-min
    2007, 21(2): 117-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mongolian belongs to complex scripts, and so far it can’t be rendered correctly in the OS and office suite. OpenOffice.org is a multi-platform office suite, being able to run in Linux and Windows, and to invoke ICU LayoutEngine and Uniscribe to process complex scripts in Linux and Windows respectively. In this paper, we based on OpenOffice.org supporting Mongolian in Linux, analyzed complex script process of OpenOffice.org in the Linux and Windows, and implemented Mongolian presentation in OpenOffice.org, by integrating ICU LayoutEngine into Uniscribe.
  • Review
    ZHANG Xing-liang, RUI Jian-wu, XIE Qian, CHENG Wei, WU Jian
    2007, 21(2): 122-128.
    There lacks corresponding coded character standards which regulate the commonly used BrdaRten characters in software development. The newly shipped National standards, “Information technology—Tibetan coded character set—Extension A” and “Information technology-Tibetan coded character set-Extension B”, are of great importance to the standardization and globalization of software development in China. In this paper, the encoding methods adopted by Tibetan character set defined in ISO/IEC 10646 and Tibetan coded character set extension A/B are compared; critical problems for implementations are analyzed. Finally, for the special characteristics of Tibetan coded character set—Extension B, based on Linux I18N architecture, a reasonable solution is proposed to thoroughly support the newly shipped National standards.