2007 Volume 21 Issue 1 Published: 15 February 2007
  

  • Select all
    |
    Review
  • Review
    QIN Ying, WANG Xiao-jie, ZHANG Su-xiang,
    2007, 21(1): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    One of challenges in Chinese Word Segmentation is the combinational ambiguity problem with two main obstacles: the detection of combinational ambiguities and ambiguity resolution. This paper investigate the structures of combinational ambiguities and proposes a new approach for automatically detecting this type of ambiguities. The experimental result reveals the approach is effective in the tagged corpus of 1998-01 People Daily with about 1 million words, we have detected more than 400 combinational ambiguities,far more than that detected by common approaches. Then the resolutions of 60 combinational ambiguities are carried out using the maximum entropy model. The effect of six kinds of features, as well as their combination, on the performance of disambiguation is further studies. The average accuracy of disambiguation reaches 88.05%.
  • Review
    Kang Byeong-Kwu, ZHANG Qin-long, CHEN Yi-rong, CHANG Bao-bao
    2007, 21(1): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper suggests a methodology which is aimed to extract multi word chunks for translation purposes. Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules. The extraction system used in our work operated at four steps: (1) Tokenization of Chinese corpus; (2) Extraction of multi-word chunks(2-gram to 10-gram) using Nagao’s Algorithm and Substring Reduction Algorithm; (3)Statistical Filtering which combines Mutual Information (or Log-likelihood Ratio) and Left/Right Entropy; (4) Linguistic filtering by chunk formation rules and stop-word list. As a result, hybrid method proved to be a suitable method for selecting multi-word chunks, it has considerably improved the precision of the extraction which is much higher than that of purely statistical method. We believe that multi-word chunks extracted in this way could be used effectively to supplement existing translation memory database.
  • Review
    HUANG He-yan, ZHANG Ke-liang,ZHANG Xiao-fei
    2007, 21(1): 17-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the design and implementation of specialty machine translation systems, a crucial concern is the efficient organization of domain-specific technical terms and the intelligent selection of terminological meanings on the basis of the text being processed. This paper begins with an analysis of some problems ubiquitous in technical lexicons for specialty MT systems and a brief introduction to the features of ontology-based domain-specific conceptual systems. Some important aspects of specialty MT-oriented technical lexicons are then studied, including the design of general-purpose specialty ontology, the description of technical terms and their mapping to specialty ontology, the organization and application of bilingual or multilingual MT domain-specific lexicons. Last, the paper presents some of the experimental work, covering the design of a draft MT-oriented specialty classification system, the mapping from technical lexicons to specialty classification system, and the mapping from ICS(International Classification System) to the MT specialty classification system. The results of the mapping experiments prove that the classification system conducted by the paper has a desirable coverage over MT technical lexicons.
  • Review
    JIANG Long, ZHOU Ming, Chien Lee-feng
    2007, 21(1): 23-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a novel approach to improve the named entity translation by combining transliteration with web mining. For the details of the approach, a transliteration model is used to generate a translation candidate, and then the web information applied to get more translations. A Maximum Entropy (ME) model is employed to rank the translation candidates with various features such as pronouncing similarity, contextual features, co-occurrence etc. The experimental results show that our approach effectively improves the precision of the named entity translation by a large margin.
  • Review
    WANG Hong-jun, SHI Shui-cai, YU Shi-wen,XIAO Shi-bin
    2007, 21(1): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    To retrieve translations of a document is very helpful for bilingual parallel corpora construction. This paper proposes an improved approach for this purpose, which uses statistical translation model to match bilingual word-pairs, uses weights of word-pairs as features for computing similarity and uses a new Dice-based method to compute Cross-Language document similarity. The approach was evaluated by measuring the numbers of how many times the translation of a given document was identified in the top N similar documents. Although two noisy datasets were used in the experiment, about 90% translations were identified in the top 5 similar documents. The experimental results show that the weighs of bilingual words-pairs are good features for similarity computing and this approaqch can effectively find translation equivalent of a document in other languages.
  • Review
    JIANG Hong-chen, ZHENG Rong, ZHANG Shu-wu, XU Bo
    2007, 21(1): 49-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an automatic language identification (LID) system which uses shifted delta cepstra (SDC) feature vectors and universal background model (UBM). SDC feature is created by stacking delta cepstra computed across multiple speech frames and is involved with much more temporal information than conventional MFCC feature. UBM represents the characteristic of all different languages and each language model is obtained by employing the Bayesian adaptation from this UBM. Compared with the conventional GMM method, the training and testing speed of this method is much faster. This system performance is evaluated on the OGI corpus. The best identification accuracy for 11-languages is 73.28% for 10-s utterances, 82.62% for 30-s utterances and 85.23% for 45-s utterances. The processing speed is about 0.03 times real time.
  • Review
    DONG Hong-hui, TAO Jian-hua, XU Bo
    2007, 21(1): 54-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a linguistic constraint model and a phrase-length constraint model to describe the prosodic phrasing process. Each part of them can be described in detail. In the linguistic constraints model, Chunk is considered as an important basic unit. And an HMM is used to model the phrase-length constraints, which include the distributions of the prosodic phrase lengths and the prosodic word number in the prosodic phrase. Then a k-candidate method is introduced to combine these two models. This approach makes full use of the linguistic constraints and the phrase-length constraints. The experiments show that this approach achieved a good performance with the phrasing f-score 82.9%.
  • Review
    JIA Yan-min, WU Jian
    2007, 21(1): 60-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Document processing is a key part of script handling. With regard to typesetting for multilingual text, the document processing model supporting multilingual text layout direction based on “frame” is proposed in this paper. In this model, the process of text layout is encapsulated in the module of document formatting. So the problem of text layout in multi-directions is reduced to the problem of text layout from left to right horizontally. The document formatting recursive algorithm of text layout in multi-directions for this model is also designed. Diffrerent text layout directions of various scripts, including Mongolian, Tibetan and Uighur are supported in this model.
  • Review
    ZHOU Ke-lan ,LV Qiang ,ZHANG Yu-hua,PAN Ji-si,QIAN Pei-de
    2007, 21(1): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The National standard GB18031 plays an important role in the evaluation of digital input method .But when the standard being applied to real world there are many difficulties. And there is no standard about the functions of the digital input method software. Therefore establishing a scientific national standard on software function of digital input method becomes an emergency. This paper first discusses how to judge the easily learning in GB18031 specification. The quantitative analysis about the execution difficulties of the selection rate is processed. Then the characteristic of the digital input systems from the view of the non-professional requirements is summarized and some concrete suggestions about how to revise GB18031 is given.Finally this paper explains the essentiality of establishing the national standard about functions of digital input method and conducts a elementary research on how to set up such standards.
  • Review
    FANG Gui-ming,
    2007, 21(1): 74-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    The scheme of Chinese phonetic alphabet plays an important role in Chinese information processing. Hanzi input by Pinyin is one of popular Methods in China.. The number of characters in the final of Chinese syllable is from 1 to 4. A compressed scheme to shorten the finals is proposed. The finals with 2 to 4 characters originally is coded by only 2 characters in the set of {a,o,e,i,u,v} in order to make them short. For the characters in the finals are different from the 20 characters in the initials of Chinese syllable. Hanzi input with keyboard for Chinese phrase can be done by the combination of “initial, final, initial”. This scheme can be used in alphabet keyboard and even better in numeric keypad. 4 keys for tones of pinyin are different with the 6 keys for the finals, so the boundary of each syllable is clear even the finals are omitted. Each pair of fuzzy initials or finals has 3 codes in numeric keypad, in order to help these users who need the fuzzy code for partial Chinese characters.
  • Review
    LIU Huai-jun, CHE Wan-xiang, LIU Ting
    2007, 21(1): 79-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the natural language processing field, researchers have experienced a growth of interest in semantic role labeling by applying statistical and machine-learning methods. Using rich features is the most important part of semantic parsing system. In this paper, some new effective features and combination features are proposed, such as next word of the constituent, predicate and phrase type combination, predicate class and path combination, and so on. And then we report the experiments on the dataset from Chinese Proposition Bank (CPB). After these new features used, the final system improves the F-Score from 89.76% to 91.31%. The results show that the performance of the system has a statistically significant increase. Therefore it is very important to find better features for semantic role labeling.
  • Review
    HE Zhong-jun, LIU Qun , LIN Shou-xun
    2007, 21(1): 85-89.
    Abstract ( ) PDF ( ) Knowledge map Save
  • Review
    SUN Jing-guang, CAI Dong-feng, LV De-xin, DONG Yan-ju
    2007, 21(1): 90-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question answering system can provides a precise and concise answer to a natural language query. Question classification is the first task of Question Answering System, and the precision of question classification has great effect on the subsequent processes. In this paper, we present a new method on feature extraction which uses HowNet as semantic resource, and use Maximum Entropy Model to realize it. We choose the interrogative words, syntax structure, question focus words and their first sememes as classification feature. The experiment result show that the first sememes in HowNet can express the main meaning of the question focus words, it can be as an important feature. This method can improve the precision of question classification: the classification precision of coarse classes and fine classes reaches 92.18% and 83.86% respectively.
  • Review
    XU Lin-hong, LIN Hong-fei, YANG Zhi-hao
    2007, 21(1): 96-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    At the fields of spam filtering, information security and automatic summarizations, text orientation identification is used widely. The paper presents the mechanism based on Semantic Comprehension for text orientation identification. Firstly, it acquires the semantic orientation through computing semantic similarity the vocabulary and tagged vocabulary in How-Net, and it adopts the derogatory or commendatory terms as features of classification. It utilizes Support Vector Machine classifier to identify the text orientation. Finally it deals with the negative sentence via matching negative rules. At the same time, it also identifies the derogatory or commendatory intensity through degree adverb in order to improve the accuracy of classification.
  • Review
    ZHANG Da-kun, ZHANG Wei, FENG Yuan-yong, SUN Le
    2007, 21(1): 101-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    The phrase-based statistical machine translation model is still the most popular model nowadays. However, non-contiguous phrases are not taken into account in this model. A statistical machine translation model based on non-contiguous phrases is proposed in this paper. The units of translation are extended from contiguous phrases to phrases with intervals in order to take advantage of the context dependence. With the less numbers of phrases, the efficiency of the decoder in our model is also improved. Experiments show that with a better efficiency the translation results of our non-contiguous phrase-based model and hierarchical model are comparable.
  • Review
    YU Hui-jia, LIU Yi-qun, ZHANG Min, RU Li-yun, MA Shao-ping
    2007, 21(1): 109-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    User log analysis is important for both Web information retrieval technologies and commercial search engine algorithms. In order to better understand search behavior of Chinese Web search users, we presents an analysis of Sogou Search Engine query log consisting of approximately 50 million entries for search requests over a period of one month. The analysis includes search retrieval behavior in individual queries distribution, user request customs in the same session and whether using advanced search functions. Conclusions may help improve Web information retrieval algorithms and search performance evaluation methods.
  • Review
    XU Fang, ZONG Cheng-qing, WANG Xia
    2007, 21(1): 115-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a hybrid error-driven combination approach to chunking Chinese Base noun phrase (Chinese Base NP), which combines TBL (Transformation-based Learning) model and CRF (Conditional Random Field) model. First, we give an overview of the Chinese and English Base NP chunking, followed by a description of the Chinese Base NP chunking task. In order to analyze the results respectively from the two (TBL-based and CRF-based) classifiers and improve the performance of the Base NP chunkers, an error-driven SVM (Support Vector Machine) based classifier is trained from the classification errors of the two classifiers. According to our experiments, the hybrid method achieves the best results with F-measure of 89.72% and improves by 2.35% in the best case compared with other methods.