2010 Volume 24 Issue 1 Published: 15 February 2010
  

  • Select all
    |
    Review
  • Review
    LI Shoushan,HUANG Chu-Ren
    2010, 24(1): 3-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper focuses on the word boundary decision (WBD) approach to Chinese word segmentation. This new approach classifies a boundary between two characters into either a word boundary or not. Compared to the stat-of-the-arts methods based on character tagging, this approach is easier to implement and faster to execute, as well as a competitive performance. Particularly, the robust online learning module can be added to adapt a WBD system to new data quickly, enabling a reliable online Chinese segmentation system without domain or training data constraints.
    Key wordscomputer application; Chinese information processing; Chinese word segmentation; WBD approach; online learning
  • Review
    LI Yuelun,CHANG Baobao
    2010, 24(1): 8-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Word Segmentation is a crucial step in the study of Chinese Natural Language Processing (NLP). In previous researches, the Maximum Entropy model and Conditional Random Field(CRF) model have been widely used in the study of Chinese Word Segmentation. In this paper, we will apply the M3N(Max Margin Markov Networks) model, a structural model introduced by B. Taskar, to Chinese Word Segmentation. Experiments based on certain training and testing corpus show that the M3N is a very useful Chinese Word Segmentation Method with a fairly high precision of 95%.
    Key wordscomputer application; Chinese information processing; maximum margin Markov networks(M3N); Chinese Word Segmentation(CWS); machine learning
  • Review
    HE Saike1, WANG Xiaojie2, DONG Yuan1,3, ZHANG Taozheng2, BAI Xue2
    2010, 24(1): 15-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method combining supervised learning with unsupervised method to conduct Chinese word segmentation (CWS), which incorporates the Accessor Variety (AV) into the Conditional Random Fields (CRFs). To solve the flaw in Accessor Variety (AV) when dealing with limited training data, normalization is introduced to alleviate the fluctuation in the AV value in the phrase of unsupervised segmentation. Experiments on the Bakeoff-4 data indicate that normalized Accessor Variety is effective both for close and open tracks.
    Key wordscomputer application; Chinese information processing; unsupervised segmentation; CRFs; normalized accessor variety
  • Review
    XING Fukun1,2, SONG Rou1, LUO Zhiyong1
    2010, 24(1): 20-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    A statistical language model named Symbol-and-Statistics Decoding (SSD) language model is presented in this article. The 2-gram SSD model is applied to the Chinese POS tagging task with a quite good result. The precision is as high as 97.08% in the closed test and 95.67% in the open test is, which are both significantly higher than the HMM at 95.56% and 94.70%, respectively. Although the performance of SSD model is not as good as the conditional models such as Maximum Entropy Model and CRF model, the training time of SSD is much less than the conditional models, which makes SSD model more applicable to certain tasks in natural language processing.
    Key wordscomputer application; Chinese information processing; SSD model; HMM; POS tagging
  • Review
    WANG Bukang,WANG Hongling,YUAN Xiaohong,ZHOU Guodong
    2010, 24(1): 25-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dependency representations are more simple and intuitive than constituent representations for Chinese parse. This paper implements a Chinese dependency parse based semantic role labeling (SRL) by using the similar methods in English SRL. In the system, effective pruning algorithm and useful features are adopted for Chinese dependency tree, and the semantic role identification and classification are accomplished by a maximum entropy classifier. Two different corpora are adopted to test our system, one is transferred from constituent-based corpus (CTB5.0), and the other is Chinese dataset provided by CoNLL 2009 shared task. Based on the two datasets, the system achieves, respectively, 84.30% and 81.68% in labeled F1 for gold predicates, and 81.02% and 81.33% for automatic predicates.
    Key wordscomputer application; Chinese information processing; Semantic Role Labeling;Dependency Relations;maximum entropy classifier
  • Review
    LI Shuanghong, LI Ru, ZHONG Lijun,GUO Weiyu
    2010, 24(1): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is an effective way to understand the semantic information of a sentence by extracting the frame kernel dependency graph from the sentence. It is necessary to extract semantic core words for each frame element to further establish the frame kernel dependency graph since we can only extract the frame dependency graph from a sentence based on the automatic annotation of CFN, This paper proposes a method to identify and extract the core words of frame elements by multi-word chunk. On the basis of comparative analyzing results, we propose the strategy of integrating the multi-word chunk and frame element and the rules to extract the core words of frame elements from the multi-word chunk labeling. The experimental results from 6 771 frame elements show that the average precision and average coverage are 95.58% and 82.91%, respectively.
    Key wordscomputer application; Chinese information processing; Frame element;Semantic core words;Multi-word chunk
  • Review
    LI Zhenghua, CHE Wanxiang, LIU Ting
    2010, 24(1): 37-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a high-order parsing model which uses all grandchildren nodes to compose high-order features, constrains the searching space by the beam-search strategy, and finds the approximately optimal dependency tree. In addition, we explore rich dependency label features and allow multiple relations for one arc during decoding. In the CoNLL 2009 international evaluation task of multilingual syntactic and semantic dependency parsing, this method ranks first in the joint task, and third in the syntactic parsing task.
    Key wordscomputer application; Chinese information processing; Beam-search; High-order Model; Dependency Parsing
  • Review
    KANG Shiyong, XU Xiaoxing
    2010, 24(1): 42-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the “corpus of syntax and semantics information in Modern Chinese”, this paper constructs a hierarchical sentence system for modern Chinese by integrating the sentence pattern, sentence model and sentence stem, which are originally somewhat independent to each other. By defining [P], [SP], [SPO] and [PO] as the basic sentence patterns, this paper investigates, via the decomposition analysis, the leading role of the high frequency sentence models corresponding to the basic sentence patterns in generating complex sentence models. In addition, this paper also examines the related correspondences of the complement, the adverbial and the appositive respectively. By the exploration on the combinations and the mappings between simple and complex sentence patterns as well as the simple and complex sentence models, this paper discloses a new breakthrough point for the study on the corresponding relation between sentence pattern and sentence model.
    Key wordscomputer application; Chinese information processing; corpus; sentence pattern; sentence model; aentence stem; sentence system
  • Review
    CHEN Yirong, LU Qin, LI Wenjie, CUI Gaoying
    2010, 24(1): 48-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    A core ontology models fundamental domain knowledge and bridges the gap between an upper ontology and a domain ontology. Since the upper ontology is domain independent, many errors are introduced when mapping core terms to the upper ontology concepts in automatic Chinese core ontology construction. This paper proposes an extraction method making use of terms sharing the same suffixes to find the hypernymsthe term that is more frequently shared by other terms and are closer in meanings to those terms. These hypernyms are then used to improve the mapping of these terms to the correct concepts. Experiments show that a significant improvement is achieved in terms of accuracy for core ontology construction.
    Key wordscomputer application; Chinese information processing;ontology construction; core ontology; upper ontology; domain ontology; hypernymy
  • Review
    KANG Wei1,2, SUI Zhifang1,2
    2010, 24(1): 54-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose a weakly-supervised method of extracting Ontology concept instances and attributes from the Web. We automatically acquire the co-occurrence patterns of the concept instances and attributes from the Web, and we evaluate these patterns based on the assumption that concept instances are relevant to their attributes. Furthermore, we extract the candidate concept instances and attributes. This paper proposes two ways to evaluate the accuracy of the candidate instances and attributesthe first measure is based on the correlation between concept instances and attributes, and the second one is based on the distribution similarity on the context patterns between the candidate instances (or attributes) and the seed instances (or attributes). Experiments on disease domain show that the precision of the top 500 and 1 000 results reaches 94% and 93%, respectively.
    Key wordscomputer application; Chinese information processing; web;domain concept instance extraction;attributes extraction;weakly-supervised;contextual pattern
  • Review
    DONG Qiang, HAO Changling, DONG Zhendong
    2010, 24(1): 60-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper introduces a HowNet-based disambiguator named VXY. The disambiguator effectively tackles the ambiguity in syntactic structures, e.g. “削(V)苹果(X)的 皮(Y)”, which appear highly-frequently in Chinese. The ambiguity of this kind lies in which word is governed by V in the structure, either X or Y. The HowNet-based disambiguator VXY is not merely a demonstration for the stereotypic methodology or algorithm, but a practical tool. for any structures composed by any one of the 98000 unique entries in HowNet Chinese vocabulary. Hence, the paper presents a paradigm completely different from the state-of the-art human language technology.
    Key wordscomputer application; Chinese information processing; semantics; disambiguator; strong government; Chinese syntactic structure; HowNet
  • Review
    WANG Yongxin, CAI Lianhong
    2010, 24(1): 65-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic prosody structure prediction is a very important component in high quality Text-to-Speech systems which directly affect the naturalness and expressivity of synthesized speech. A text corpus with both syntactic and prosodic structures annotated is constructed. Based on the corpus, the composition of prosody structure and the relationship between syntactic and prosodic structures are analyzed, and a prediction experiment is carried out. The result shows that, though there are differences between the prosodic structure and the syntactic structure in Chinese, they have a close relationship the prosodic structure can be predicted by the syntactic structure. The prosodic structure is also affected by the semantic information of the sentence.
    Key wordscomputer application; Chinese information processing; TTS; prosodic structure; syntactic structure; semantic information
  • Review
    ZHAI Haijun1, GUO Jiafeng2, WANG Xiaolei2, XU Hongbo2
    2010, 24(1): 71-77.
    Mining named entities from query logs is an important research field in data mining. Previous work proposed a seed-based framework to mine named entities from query logs by leveraging distribution similarity, which works well only when each named entity only belongs to a signle semantic class. In fact, named entities may often belong to multiple classes. In this paper, we introduce a weakly-supervised topic model to resolve class ambiguity of named entities by leveraging weak supervision from human. The experiment results show that our approach significantly outperforms the previous method.
    Key wordscomputer application;Chinese information processing;named entity;query log;topic model
  • Review
    WU Qiong1,2, TAN Songbo1, ZHANG Gang1, DUAN Miyi1, CHENG Xueqi1
    2010, 24(1): 77-84.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper focuses on the opinion analysis of documents, i.e. to determine the overall opinion (e.g., negative or positive) of a given document. Existing studies have shown that, the supervised classification approaches usually perform well in this task. However, in most cases, the performance decreases sharply when the model is transferred from the labeled data domain to a different target domain without labeled data. This raises the issue of cross-domain opinion analysis. In this paper, we propose an iterative algorithm which integrated the opinion orientations of the documents into the graph-ranking algorithm for cross-domain opinion analysis. We apply the graph-ranking algorithm using the accurate labels of old-domain documents as well as the “pseudo” labels of new-domain documents. Over the results of the iterative algorithm, we try to further improve the performance by choosing the test documents whose opinions have been determined more accurately as “seeds”, and applying the EM algorithm again for cross-domain opinion analysis. The experiment results indicate that the proposed algorithm could improve the performance of cross-domain opinion analysis dramatically.
    Key wordscomputer application; Chinese information processing; cross domain; opinion analysis; graph ranking; EM algorithm
  • Review
    LIU Hongyu, ZHAO Yanyan, QIN Bing, LIU Ting
    2010, 24(1): 84-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on sentiment analysis is a hot issue in natural language processing. This paper makes an intensive study of the two stages of sentiment analysisthe comment target extraction and the corresponding sentiment classification. For the first task, we use the syntactic analysis to obtain the candidats, and then combine PMI based on web mining and NN filtering algorithm to decide the targets. For the second task, we design certain heuristic rules by analyzing subjective sentences, and then apply these rules to predict the orientation of opinion in the sentences. This method performs well in Task Three of the COAE2008 i.
    Key wordscomputer application; Chinese information processing; sentiment classification; target; orientation judgment; syntactic analysis
  • Review
    SONG Xiaolei1,WANG Suge1,2,LI Hongxia1
    2010, 24(1): 89-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    The comment target recognition for the products is one of the important topics in text opinion information extraction and the sentiment analysis. For car product reviews, this paper proposes an unsupervised method to recognize comment targets without relying upon additional resources. In this method, we employ the fuzzy match technique for the word templates and part of speech templates and the pruning technique to extract candidate evaluated objects. Then the bidirectional Bootstrapping approach is used to recognize the comment targets from the candidate set. Lastly, the comment targets of the products are clustered by the K-means method to recognize the product name and the product attributes. The experimental results indicate that the F-value of the recognition of the comment targets and the product names can achieve 58.5% and 69.48% respectively.
    Key wordscomputer application; Chinese information processing; comment target of product; product name; product attribute; template; K-means clustering; bidirectional Bootstrapping
  • Review
    LI Chao,WANG Huizhen,ZHU Muhua,ZHANG Li,ZHU Jingbo
    2010, 24(1): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic multi-word terms extraction attracts more and more attention in the research of natural language processing. This paper proposes a Multi-Class C-value method, which uses the distribution of multi-word terms in different domains, to improve the performance of multi-word terms extraction. In the experiment with the data of automobile, technology and trip, the precisions of top 100 multi-word terms are 12%, 12% and 13% higher than the clssical C-value method in three domains respectively.
    Key wordscomputer application; Chinese information processing; multi-word terms extraction; Multi-Class C-value;domain information
  • Review
    XIA Yunqing1, YANG Ying2, ZHANG Pengzhou2, LIU Yufei3
    2010, 24(1): 99-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Song sentiment analysis has not been satisfactorily addressed in audio signal processing community. In this paper,the lyric is used as proof for song sentiment analysis and the sentiment vector space model (s-VSM) is proposed to represent given lyric. Compared to the word-based vector space model (w-VSM), the s-VSM model successfully addresses the critical issues on text representation efficiency, ambiguity, functionality and data sparseness. Furthermore, the two-dimension Thayer sentiment stress model, i.e. light-hearted and heavy-hearted, are extended to a four-dimension model to incorporate two extra sentiment stress levelscomplicated and implied level. Experiments show that 1) the s-VSM model outperforms the traditional methods; and 2) the four-dimension sentiment stress model is helpful to further improve performance of song sentiment analysis.
    Key wordscomputer application; Chinese information processing; sentiment analysis; sentiment vector space model; sentiment stress
  • Review
    MA Yongliang, ZHAO Tiejun
    2010, 24(1): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Chinese-English statistical machine translation (SMT), Chinese texts usually demands Chinese word segmentation (CWS) to identify the words in a sentence. However, CWS is not developed for SMT and hence its results are not necessarily optimal for SMT. In recent years, many investigations have been performed concerning making CWS suitable for SMT, but we explore it from another direction. In this paper, our basic idea is to use multiple CWS results as additional language knowledge source and we present a simple and effective approach to use multiple CWS results for SMT. We also give experiment results over a series of combining strategy, and the best result shows 1.89 percentage gain in BLEU points over a start-of-the-art SMT system.
    Key wordsartificial intelligence; machine translation; statistical machine translation; Chinese word segmentation; feature interpolation of translation model; multi-strategy feature blending of translation model
  • Review
    XIAO Tong, LI Tianning, CHEN Rushan, ZHU Jingbo, WANG Huizhen
    2010, 24(1): 110-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word alignment is one of the key techniques in statistical machine translation (SMT). In this paper, we propose a word realignment method, which recognizes the inconsistent parts between the bidirectional alignments generated by IBM models at first, and refines then the word alignment by realigning the inconsistent parts. To reinforce our method, a monolingual feature is used to make benefits from large-scale monolingual corpus. The effectiveness of the method is demonstrated on a state-of-the-art phrase-based SMT system. The experimental results show that our method can achieve higher translation accuracy than the widely-adopted heuristics-based method.
    Key wordsartificial intelligence; machine translation; statistical machine translation; word alignment; word realignment; IBM models
  • Review
    JIANG Shangpu1,2 , CHEN Qunxiu1,2
    2010, 24(1): 117-123.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word segmentation and part-of-speech tagging is the first step of Japanese natural language processing tasks, such as machine translation in which Japanese is the source language. In this paper, a Japanese word segmentation and POS tagging approach based on rules and statistics is proposed. Adopting a single perceptron based joint word segmentation and POS tagging algorithm as the basic framework, this method is combined with the features of adjacency attributes which are derived by heuristic rules. The experiment on a small test dataset shows that the new approach achieves an F-score of 98.2% on word segmentation, and 94.8% on both word segmentation and POS tagging. This work has already been applied into the Japanese-Chinese machine translation system successfully.
    Key wordsartificial intelligence; machine translation; Japanese-Chinese machine translation system;Japanese word segmentation;Japanese POS tagging;joint word segmentation
  • Review
    ZHOU Qiang, LI Yumei
    2010, 24(1): 123-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper introduces three chunk parsing tasks carried in current CIPS parsing evaluation workshop (CIPS-ParsEval-2009), which is organized by Tsinghua University and Northeastern University. They are base chunk parsing, functional chunk parsing and event description clause recognition tasks. The designing motivation and the classification standards of these three chunks are discussed in the paper. Based on the detailed syntactic annotations in Tsinghua Chinese Treebank (TCT), three benchmark chunk banks automatically extracted from TCT are built. The evaluation results of top-5 participating systems are also given. The data analysis from their statistics and the comparison with current chunk schemes show some characteristics of these three chunk parsing tasks.
    Key wordscomputer application; Chinese information processing; base chunk; functional chunk; event description clause; chunk banks