2010 Volume 24 Issue 6 Published: 15 December 2010
  

  • Select all
    |
    Review
  • Review
    WANG Meng1,2,HUANG Chu-ren2,YU Shiwen1,LI Bin3
    2010, 24(6): 3-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    Noun compound interpretation is to recover the implicit semantic relation between the head and modifier. In this paper, we present a dynamic approach to use paraphrasing verbs to interpret the meaning of Chinese noun compounds automatically for the first time in the literature. The experimental results show that this approach not only provides the possible interpretations for one noun compound, but also reflects the subtle semantic differences of similar noun compounds. In addition, our research can be applied in some other fields such as question answering, information retrieval and lexicography.
    Key wordsChinese noun compounds;interpretation;paraphrase;paraphrasing verbs
  • Review
    JIANG Xin 1, JIANG Yi 1, FANG Miao 2, WANG Rongpei1
    2010, 24(6): 10-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    This study proposes a new fast segmentation method for classic Chinese texts based on the tree pruning process. Firstly, word candidates of two, three and multiple characters are selected with likelihood ratio statistics. Then an algorithm of fast segment is presented and a basic flow chart is illustrated. Finally, the Classic of Tea is used to verify its validity and effectiveness. The theoretical analysis and experimental instances show that the algorithm is effective and promising in computer-aided translation of classic Chinese texts.
    Key wordssegmentation; tree pruning; likelihood ratio; The Classic of Tea; computer-aided translation
  • Review
    JIAN Ping, ZONG Chengqing
    2010, 24(6): 14-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    A layer-based projective dependency parsing approach is presented. This novel approach works layer by layer in a bottom-up manner, in which the depth of token dependency is allowed no more than one. Inside the layer the dependency graphs are searched exhaustively while between the layers the parser state transfers deterministically. Taking the dependency layer as the parsing unit, the proposed parser has a lower computational complexity than graph-based models which search for a whole dependency graph, alleviating the error propagation in transition-based models to some extent. Furthermore, our parser adopts the sequence labeling models to find the optimal sub-graph of the layer, which demonstrates the sequence labeling techniques qualified for hierarchical structure analysis tasks. Experimental results indicate that the proposed approach offers desirable accuracies and especially a very fast parsing speed, with 2500 words per second for Penn Treebank.
    Key wordsdependency parsing; dependency layer; sequence labeling
  • Review
    ZHANG Liang1,2,YIN Cunyan1, CHEN Jiajun1
    2010, 24(6): 23-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word similarity analysis and computing is one of the key technologies in natural language processing. It can offer substantial help to parsing, machine translation and information retrieval etc. Recently Chinese word similarity computing based on Hownet has become a hot research issue, though most of which are improvements or modifications to what was proposed in (Liu, 2002). Based on new Hownet(2007) with its concept frame and the multi-dimension semantic expression form, this paper proposes a new method to analyze and compute Chinese word similarity from three dimensionsthe main sememe, the main sememe frame and the concept characteristic description. This method also distinguishes the semantic similarity and the syntax similarity in computation. Experiment shows that the method produces a good performance.
    Key wordssemantic tree;words similarity;Hownet2007;distance of semantic
  • Review
    LIU Qinglei, GU Xiaofeng
    2010, 24(6): 31-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word (sentence) similarity computing based on the “HowNet” usually treats the optimal matches between the primitives or words as the basic unit, and the ultimate outcome can be the sum of weighted counts. However, this approach often results in the information duplication and some irrational constructions. To deal with these issues, this paper propose to calculate the similarity of sets by the statistics on common information (commonality) and the different information (differences) between the two sets of direct primitives. Moreover, the paper introduces this measure into the calculation of sentence similarity. The final experimental analysis shows that the proposed method is more stable and effective.
    Key wordsHowNet; word similarity; sentence similarity; common information; different information
  • Review
    LI Zhihua,REN Qiuying,GU Yan,WANG Shitong
    2010, 24(6): 37-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    A kernel-based nominal data classification (KNDC) method is proposed with a new distance definition and a simple inner product computing method in this paper. It’s insensitivity to outliers and classification capability to unbalanced data in real datasets are further analyzed. The calculation on inner product of nominal data is difficult, often regarded as the bottleneck of SVM. The KNDC possesses a lower computation complexity than SVM over the nominal dataset, which is discussed for its validity. Experimental results on the standard datasets demonstrate that the proposed method has promising performance compared with other methods.
    Key wordskernel-based classification method; nominal dataset; dissimilarity measure; inner production calculation
  • Review
    SHAN Bin, LI Fang
    2010, 24(6): 43-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    With topics evolve over time, new topics emerge and old ones decay. Many researches are devoted to detect the topic evolution automatically. Latent Dirichlet Allocation (LDA), as a recently emerged probabilistic topic model, has been widely used in the research of topic evolution. This paper discusses two aspects of evolution on topic, i.e. the content and the topic intensity. It summarizes three methods in LDA based topic evolution detection according to the dealing with timejoining the time to LDA model, post-discretizing or pre-discretizing methods. The three methods are also compared in several featuresthe time granularity, on-line or off-line, etc. In addition, the evaluation methods for topic evolution are introduced. Finally, the paper gives some analysis and suggestions for future researches on topic evolution based on LDA.
    Key wordstopic model;topic evolution;Latent Dirichlet Allocation
  • Review
    CAO Jie, LV Yajuan,SU Jinsong, LIU Qun
    2010, 24(6): 50-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Domain adaptation problem will arise when statistical machine translation (SMT) system is used to translate domain-specific texts. When the texts to be translated and the training data come from the same domain, SMT system can achieve good performance. Otherwise, the translation quality will degrade dramatically. In general, domain-specific parallel corpus is limited, while domain-mixed parallel corpus and domain-specific monolingual corpus are easy to obtain. According to the fact, this paper proposed a new translation model which utilized domain-mixed parallel corpus and domain-specific monolingual corpus to improve the domain translation quality. Experiments show that the proposed method improves translation performance in three IWSLT evaluation tests significantly.
    Key wordsstatistical machine translation; domain adaptation; context information
  • Review
    ZHOU Keyan, ZONG Chengqing
    2010, 24(6): 57-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    How to apply semantic and pragmatics information is one of the difficulties in researches on spoken language translation. Dialog act, as a description of shallow discourse structure, has been utilized in several types of translation systems. In this paper, we first introduce dialog act theory and several famous dialog act annotated corpora. Based on annotated corpus and dialog act automatic recognition technology, we propose three kinds of applications of dialog act in phrase-based translation. By introducing the dialog act classification, our approach improves the consistencies between the training data and the test data, between the develop set and the test set, and between the source language and the target language. Further, the translation process is more effective and translation result is more accurate in reflecting the intention of source language. The experimental results on Chinese-to-English spoken language show that dialog act can make the spoken language translation system more accurate and effective.
    Key wordsdialog act; spoken language translation; dialog act classification
  • Review
    GUO Haoting1, 2, ZHENG Fang2, LUO Canhua2, LI Yinguo1
    2010, 24(6): 64-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    The speaker recognition technology is an important and popular user authentication method in daily life due to its convenience, economy, and easy-to-acceptance. However the current algorithms cannot meet the real-time requirements in embedded applications. Based on the Non-Linear Partition (NLP) algorithms used in speech recognition, a novel algorithm is proposed and applied to the embedded Text-Dependent Speaker Recognition. Compared with the traditional Dynamic Time Warping (DTW) based algorithms, it achieves a good practical result in terms of real time performance.
    Key wordsspeaker recognition; Text-Dependent; embedded application; Non-Linear Partition
  • Review
    YE Na,CAI Dongfeng
    2010, 24(6): 69-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are two difficulties in the technique of query-focused multi-document summarization. First, to ensure the high relevancy with the query, the summarization tends to be repetitive. Second, the original query needs to be expanded to fully reflect user’s intention, but current query expansion methods usually depend on exterior linguistic resources. To solve the above problems, this paper proposes a query-focused multi-document summarization approach, in which subtopics are identified by topic analysis technique. While selecting sentences, both the relevancy with query and the importance of the subtopic are considered. Then, the query is expanded according to the co-occurrence of words among subtopics without using any external knowledge. Experimental results on DUC2006 corpus show that the new approach achieves higher performance than the baseline system. The query expansion method further improved the summarization quality.
    Key wordsquery-focused;multi-document summarization;subtopic;relevancy;query expansion
  • Review
    LI Yanan1,2,WANG Bin 1,LI Jintao1
    2010, 24(6): 75-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query suggestion, i.e. generating related queries or keywords for an initial one, has been widely utilized in search engines and sponsored search systems. As one of the necessary techniques in search engines, query suggestion draws more and more attentions in NLP and IR community. In recent years, many studies have been done to validate the usefulness of query suggestion and to improve its effect. This paper introduces the state of the art in query suggestion, including its history, approaches and evaluation methods. The paper analyzes the challenges, discusses the possible solutions and suggests future works.
    Key wordscomputer application; Chinese information processing; survey; query suggestion; information retrieval
  • Review
    LIU Xiangtao1,2, GONG Caichun3, LIU Yue1, BAI Shuo1
    2010, 24(6): 85-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Kad network, there are hundreds of millions of shared resources, among which a considerable part can be rated as questionable information. In order to understand the characteristics of resources, especially questionable ones, in Kad network, the file resources of peers are measured and analyzed using the Kad-network crawler Rainbow. We find that1) both the popularity of files and the number of filenames corresponding to a file approximately fit Zipf distribution; 2) the severity of questionable files can be judged more accurately using co-occurrence-words in multiple filenames corresponding to the same file-content-hash; 3) the questionable resources only occupy 6.34% of random samples, and 74.8% of which are video files.
    Key wordsPeer-to-peer network; Kad network; measurement and analysis; questionable resource
  • Review
    JIANG Peng,SONG Jihua
    2010, 24(6): 92-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper combines DF and CHI to select features of web pages related to the area of teaching Chinese as a second language (TCSL). A classifier is first constructed based on two-step topic similarity measurement by the title and the main text. The classifier is then applied to crawling web pages related to TCSL, and the results show substantial improvements on efficiency and recall rate compared with traditional methods. Now this classifier has been deployed for data collection for a big TCSL corpus in actual practice.
    Key wordsDF; CHI; classifier; focused crawler
  • Review
    CAO Fang, WU Zhongke, AO Xuefeng, ZHOU Mingquan
    2010, 24(6): 97-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    Vector Chinese characters are popular for the high-quality of display and output under the transformations like zoom, rotation. Therefore, the vectorization of Chinese characters is a fundamental issue for Chinese Information Processing. We propose a vectorization algorithm to tracing the outline of 3 755 frequently used characters in the style of Qi Gong calligraphy. A vector character includes the representation of these strokes and their sequence, which may serve as a kind of support to the studying of Qi Gong calligraphy. The paper presents the details in contour extraction, stroke extraction and the final optimization.
    Key wordsvectorization; Chinese calligraphy; Qi font; stroke
  • Review
    NIE Yanzhao,LIU Yongge
    2010, 24(6): 103-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Digitalization of the Carapace-bone-script requires support of input method. To improve the existing method of the Carapace-bone-script input, a stroke coding scheme of the Carapace-bone-script is presented. The implementation of corresponding input method proves its feasibility, which may serve as a more convenient alternative to inputting the Carapace-bone-script.
    Key wordsCarapace-bone-script; input method; stroke
  • Review
    HUANG Jinwen, JIN Hua,WANG Fan,CHEN Binhong,
    HE Yongshu,CHEN Xiaowei,LIN Qingwen,HUANG Xiaoming
    2010, 24(6): 108-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve the current keyboard to better support the Chinese Pinyin IME, this paper suggests a new letter layout scheme based on the statistics on the frequencies Chinese Pinyin(phonetic alphabet). The new scheme is compared with the current keyboard layout from three aspectsthe static loading, dynamic loading and the alternate rate of the left and right hand. In terms of workload, there is a linear decline from the forefinger, the middle finger, the ring finger and little finger, which is better coordinating to the practical efficiency of each finger. And the alternate rate of left hand and right hand is 0.74833, which is a more relaxed condition. These statistic figures validate that the new design would significantly enhance the efficiency of the Chinese characters input.
    Key wordsChinese information processing; keyboard layout; Chinese Pinyin; Pinyin IME
  • Review
    SUN Ruina 1, Gulila·Altenbek 2
    2010, 24(6): 114-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    An automatic identification system for Kazakh basic noun phrase is presented. Adopting the rule based identification method and manual annotation, the corpus of Kazakh base noun phrase is first constructed. Then, a combined approach using statistical information and linguistics rules is presented to predict the base noun phrase boundary by mutual information and correct the boundary by base noun phrase constitution rules. Experiment shows the precision is improved from 80.2% to 82.5% by combining the rules.
    Key wordscorpus; base noun phrase; Kazakh; mutual information; rules
  • Review
    FAN Daoerji,BAI Fengshan, WU Huijuan
    2010, 24(6): 120-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microsoft’s operating system has started to fully support the traditional Mongolian input, editing and typesetting in Vista. On the basis of Microsoft Mongolian input method, this paper proposes a new algorithm for the Mongolian input based on the unique characteristics of Mongolian. The algorithm supports automatic deformation calculation, automatic association input, automatic learning and the resource sharing. This paper presents an automatic deformation theory and a detailed algorithm for computing process. It also discusses the details of the Mongolian dictionary data storage, and describes the automatic learning algorithms and the solution to the resource sharing.
    Key wordsMongolian input method; Unicode; automatic deformation; Uniscribe
  • Review
    BAI Fengshan, FAN Daoerji, JIN Yuxin, WU Wei, ZHANG Lihong
    2010, 24(6): 125-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mongolian is the language generally used by China Mongolian. Most of the popular word processing tools do not support Mongolian because of its distinct writing style and variant shape. Now Linux with QTE has become a popular practice in of the field of embedded system products and application. This paper presents an algorithm of Mongolian dots displaying and variant shape transformation based on Unicode under QTE, and also defines the QTE modules supporting Mongolian. The method provides a solution to processing Mongolian in the embedded system with Linux plus QTE.
    Key wordsQTE; Linux; Mongolian; Unicode