Journal of Chinese Information Processing

Select

Review

Chinese Word Segmentation Based on Word Boundary Decision

LI Shoushan,HUANG Chu-Ren

2010, 24(1): 3-8.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper focuses on the word boundary decision (WBD) approach to Chinese word segmentation. This new approach classifies a boundary between two characters into either a word boundary or not. Compared to the stat-of-the-arts methods based on character tagging, this approach is easier to implement and faster to execute, as well as a competitive performance. Particularly, the robust online learning module can be added to adapt a WBD system to new data quickly, enabling a reliable online Chinese segmentation system without domain or training data constraints.
Key wordscomputer application; Chinese information processing; Chinese word segmentation; WBD approach; online learning

Select

Review

Maximum Margin Markov Networks-Based
Chinese Word Segmentation Method

LI Yuelun,CHANG Baobao

2010, 24(1): 8-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese Word Segmentation is a crucial step in the study of Chinese Natural Language Processing (NLP). In previous researches, the Maximum Entropy model and Conditional Random Field(CRF) model have been widely used in the study of Chinese Word Segmentation. In this paper, we will apply the M3N(Max Margin Markov Networks) model, a structural model introduced by B. Taskar, to Chinese Word Segmentation. Experiments based on certain training and testing corpus show that the M3N is a very useful Chinese Word Segmentation Method with a fairly high precision of 95%.
Key wordscomputer application; Chinese information processing; maximum margin Markov networks(M3N); Chinese Word Segmentation(CWS); machine learning

Select

Review

Apply Normalized Accessor Variety in Chinese Word Segmentation

HE Saike1, WANG Xiaojie2, DONG Yuan1,3, ZHANG Taozheng2, BAI Xue2

2010, 24(1): 15-20.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a method combining supervised learning with unsupervised method to conduct Chinese word segmentation (CWS), which incorporates the Accessor Variety (AV) into the Conditional Random Fields (CRFs). To solve the flaw in Accessor Variety (AV) when dealing with limited training data, normalization is introduced to alleviate the fluctuation in the AV value in the phrase of unsupervised segmentation. Experiments on the Bakeoff-4 data indicate that normalized Accessor Variety is effective both for close and open tracks.
Key wordscomputer application; Chinese information processing; unsupervised segmentation; CRFs; normalized accessor variety

Select

Review

Symbol-and-Statistics Decoding Model and Its Application in Chinese POS Tagging

XING Fukun1,2, SONG Rou1, LUO Zhiyong1

2010, 24(1): 20-25.

Abstract ( ) PDF ( )

Knowledge map

Save

A statistical language model named Symbol-and-Statistics Decoding (SSD) language model is presented in this article. The 2-gram SSD model is applied to the Chinese POS tagging task with a quite good result. The precision is as high as 97.08% in the closed test and 95.67% in the open test is, which are both significantly higher than the HMM at 95.56% and 94.70%, respectively. Although the performance of SSD model is not as good as the conditional models such as Maximum Entropy Model and CRF model, the training time of SSD is much less than the conditional models, which makes SSD model more applicable to certain tasks in natural language processing.
Key wordscomputer application; Chinese information processing; SSD model; HMM; POS tagging

Select

Review

Chinese Dependency Parse Based Semantic Role Labeling

WANG Bukang,WANG Hongling,YUAN Xiaohong,ZHOU Guodong

2010, 24(1): 25-30.

Abstract ( ) PDF ( )

Knowledge map

Save

Dependency representations are more simple and intuitive than constituent representations for Chinese parse. This paper implements a Chinese dependency parse based semantic role labeling (SRL) by using the similar methods in English SRL. In the system, effective pruning algorithm and useful features are adopted for Chinese dependency tree, and the semantic role identification and classification are accomplished by a maximum entropy classifier. Two different corpora are adopted to test our system, one is transferred from constituent-based corpus (CTB5.0), and the other is Chinese dataset provided by CoNLL 2009 shared task. Based on the two datasets, the system achieves, respectively, 84.30% and 81.68% in labeled F1 for gold predicates, and 81.02% and 81.33% for automatic predicates.
Key wordscomputer application; Chinese information processing; Semantic Role Labeling;Dependency Relations;maximum entropy classifier

Select

Review

Multi-Word Chunking Based Automatic Identification of the Semantic
Core Word of the Frame Element

LI Shuanghong, LI Ru, ZHONG Lijun,GUO Weiyu

2010, 24(1): 30-37.

Abstract ( ) PDF ( )

Knowledge map

Save

It is an effective way to understand the semantic information of a sentence by extracting the frame kernel dependency graph from the sentence. It is necessary to extract semantic core words for each frame element to further establish the frame kernel dependency graph since we can only extract the frame dependency graph from a sentence based on the automatic annotation of CFN, This paper proposes a method to identify and extract the core words of frame elements by multi-word chunk. On the basis of comparative analyzing results, we propose the strategy of integrating the multi-word chunk and frame element and the rules to extract the core words of frame elements from the multi-word chunk labeling. The experimental results from 6 771 frame elements show that the average precision and average coverage are 95.58% and 82.91%, respectively.
Key wordscomputer application; Chinese information processing; Frame element;Semantic core words;Multi-word chunk

Select

Review

Beam-Search Based High-Order Dependency Parser

LI Zhenghua, CHE Wanxiang, LIU Ting

2010, 24(1): 37-42.

Abstract ( ) PDF ( )

Knowledge map

Save

We propose a high-order parsing model which uses all grandchildren nodes to compose high-order features, constrains the searching space by the beam-search strategy, and finds the approximately optimal dependency tree. In addition, we explore rich dependency label features and allow multiple relations for one arc during decoding. In the CoNLL 2009 international evaluation task of multilingual syntactic and semantic dependency parsing, this method ranks first in the joint task, and third in the syntactic parsing task.
Key wordscomputer application; Chinese information processing; Beam-search; High-order Model; Dependency Parsing

Select

Review

Construction and Research on the Sentence System in Modern Chinese

KANG Shiyong, XU Xiaoxing

2010, 24(1): 42-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the “corpus of syntax and semantics information in Modern Chinese”, this paper constructs a hierarchical sentence system for modern Chinese by integrating the sentence pattern, sentence model and sentence stem, which are originally somewhat independent to each other. By defining [P], [SP], [SPO] and [PO] as the basic sentence patterns, this paper investigates, via the decomposition analysis, the leading role of the high frequency sentence models corresponding to the basic sentence patterns in generating complex sentence models. In addition, this paper also examines the related correspondences of the complement, the adverbial and the appositive respectively. By the exploration on the combinations and the mappings between simple and complex sentence patterns as well as the simple and complex sentence models, this paper discloses a new breakthrough point for the study on the corresponding relation between sentence pattern and sentence model.
Key wordscomputer application; Chinese information processing; corpus; sentence pattern; sentence model; aentence stem; sentence system

Select

Review

An Improve Method for Chinese Core Ontology Construction

CHEN Yirong, LU Qin, LI Wenjie, CUI Gaoying

2010, 24(1): 48-54.

Abstract ( ) PDF ( )

Knowledge map

Save

A core ontology models fundamental domain knowledge and bridges the gap between an upper ontology and a domain ontology. Since the upper ontology is domain independent, many errors are introduced when mapping core terms to the upper ontology concepts in automatic Chinese core ontology construction. This paper proposes an extraction method making use of terms sharing the same suffixes to find the hypernymsthe term that is more frequently shared by other terms and are closer in meanings to those terms. These hypernyms are then used to improve the mapping of these terms to the correct concepts. Experiments show that a significant improvement is achieved in terms of accuracy for core ontology construction.
Key wordscomputer application; Chinese information processing;ontology construction; core ontology; upper ontology; domain ontology; hypernymy

Select

Review

Weakly-Supervised Extraction of Ontology Concept Instances and
Concept Attributes from the Web

KANG Wei1,2, SUI Zhifang1,2

2010, 24(1): 54-60.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we propose a weakly-supervised method of extracting Ontology concept instances and attributes from the Web. We automatically acquire the co-occurrence patterns of the concept instances and attributes from the Web, and we evaluate these patterns based on the assumption that concept instances are relevant to their attributes. Furthermore, we extract the candidate concept instances and attributes. This paper proposes two ways to evaluate the accuracy of the candidate instances and attributesthe first measure is based on the correlation between concept instances and attributes, and the second one is based on the distribution similarity on the context patterns between the candidate instances (or attributes) and the seed instances (or attributes). Experiments on disease domain show that the precision of the top 500 and 1 000 results reaches 94% and 93%, respectively.
Key wordscomputer application; Chinese information processing; web;domain concept instance extraction;attributes extraction;weakly-supervised;contextual pattern

Select

Review

A HowNet-Based Disambiguator for Chinese Syntactic Structures

DONG Qiang, HAO Changling, DONG Zhendong

2010, 24(1): 60-65.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper introduces a HowNet-based disambiguator named VXY. The disambiguator effectively tackles the ambiguity in syntactic structures, e.g. “削(V)苹果(X)的皮(Y)”, which appear highly-frequently in Chinese. The ambiguity of this kind lies in which word is governed by V in the structure, either X or Y. The HowNet-based disambiguator VXY is not merely a demonstration for the stereotypic methodology or algorithm, but a practical tool. for any structures composed by any one of the 98000 unique entries in HowNet Chinese vocabulary. Hence, the paper presents a paradigm completely different from the state-of the-art human language technology.
Key wordscomputer application; Chinese information processing; semantics; disambiguator; strong government; Chinese syntactic structure; HowNet

Select

Review

Syntactic Information and Analysis and Prediction of Prosody Structure

WANG Yongxin, CAI Lianhong

2010, 24(1): 65-71.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic prosody structure prediction is a very important component in high quality Text-to-Speech systems which directly affect the naturalness and expressivity of synthesized speech. A text corpus with both syntactic and prosodic structures annotated is constructed. Based on the corpus, the composition of prosody structure and the relationship between syntactic and prosodic structures are analyzed, and a prediction experiment is carried out. The result shows that, though there are differences between the prosodic structure and the syntactic structure in Chinese, they have a close relationship the prosodic structure can be predicted by the syntactic structure. The prosodic structure is also affected by the semantic information of the sentence.
Key wordscomputer application; Chinese information processing; TTS; prosodic structure; syntactic structure; semantic information

Select

Review

Mining Named Entities from Query Logs

ZHAI Haijun1, GUO Jiafeng2, WANG Xiaolei2, XU Hongbo2

2010, 24(1): 71-77.

Abstract ( )

Knowledge map

Save

Mining named entities from query logs is an important research field in data mining. Previous work proposed a seed-based framework to mine named entities from query logs by leveraging distribution similarity, which works well only when each named entity only belongs to a signle semantic class. In fact, named entities may often belong to multiple classes. In this paper, we introduce a weakly-supervised topic model to resolve class ambiguity of named entities by leveraging weak supervision from human. The experiment results show that our approach significantly outperforms the previous method.
Key wordscomputer application;Chinese information processing;named entity;query log;topic model

Select

Review

Research on Cross-Domain Opinion Analysis

WU Qiong1,2, TAN Songbo1, ZHANG Gang1, DUAN Miyi1, CHENG Xueqi1

2010, 24(1): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper focuses on the opinion analysis of documents, i.e. to determine the overall opinion (e.g., negative or positive) of a given document. Existing studies have shown that, the supervised classification approaches usually perform well in this task. However, in most cases, the performance decreases sharply when the model is transferred from the labeled data domain to a different target domain without labeled data. This raises the issue of cross-domain opinion analysis. In this paper, we propose an iterative algorithm which integrated the opinion orientations of the documents into the graph-ranking algorithm for cross-domain opinion analysis. We apply the graph-ranking algorithm using the accurate labels of old-domain documents as well as the “pseudo” labels of new-domain documents. Over the results of the iterative algorithm, we try to further improve the performance by choosing the test documents whose opinions have been determined more accurately as “seeds”, and applying the EM algorithm again for cross-domain opinion analysis. The experiment results indicate that the proposed algorithm could improve the performance of cross-domain opinion analysis dramatically.
Key wordscomputer application; Chinese information processing; cross domain; opinion analysis; graph ranking; EM algorithm

Select

Review

Comment Target Extraction and Sentiment Classification

LIU Hongyu, ZHAO Yanyan, QIN Bing, LIU Ting

2010, 24(1): 84-89.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on sentiment analysis is a hot issue in natural language processing. This paper makes an intensive study of the two stages of sentiment analysisthe comment target extraction and the corresponding sentiment classification. For the first task, we use the syntactic analysis to obtain the candidats, and then combine PMI based on web mining and NN filtering algorithm to decide the targets. For the second task, we design certain heuristic rules by analyzing subjective sentences, and then apply these rules to predict the orientation of opinion in the sentences. This method performs well in Task Three of the COAE2008 i.
Key wordscomputer application; Chinese information processing; sentiment classification; target; orientation judgment; syntactic analysis

Select

Review

Research on Comment Target Recognition for Specific Domain Products

SONG Xiaolei1,WANG Suge1,2,LI Hongxia1

2010, 24(1): 89-94.

Abstract ( ) PDF ( )

Knowledge map

Save

The comment target recognition for the products is one of the important topics in text opinion information extraction and the sentiment analysis. For car product reviews, this paper proposes an unsupervised method to recognize comment targets without relying upon additional resources. In this method, we employ the fuzzy match technique for the word templates and part of speech templates and the pruning technique to extract candidate evaluated objects. Then the bidirectional Bootstrapping approach is used to recognize the comment targets from the candidate set. Lastly, the comment targets of the products are clustered by the K-means method to recognize the product name and the product attributes. The experimental results indicate that the F-value of the recognition of the comment targets and the product names can achieve 58.5% and 69.48% respectively.
Key wordscomputer application; Chinese information processing; comment target of product; product name; product attribute; template; K-means clustering; bidirectional Bootstrapping

Select

Review

Exploiting Domain Interdependence for Multi-Word Terms Extraction

LI Chao,WANG Huizhen,ZHU Muhua,ZHANG Li,ZHU Jingbo

2010, 24(1): 94-99.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic multi-word terms extraction attracts more and more attention in the research of natural language processing. This paper proposes a Multi-Class C-value method, which uses the distribution of multi-word terms in different domains, to improve the performance of multi-word terms extraction. In the experiment with the data of automobile, technology and trip, the precisions of top 100 multi-word terms are 12%, 12% and 13% higher than the clssical C-value method in three domains respectively.
Key wordscomputer application; Chinese information processing; multi-word terms extraction; Multi-Class C-value;domain information

Select

Review

Lyric-Based Song Sentiment Analysis by Sentiment Vector Space Model

XIA Yunqing1, YANG Ying2, ZHANG Pengzhou2, LIU Yufei3

2010, 24(1): 99-104.

Abstract ( ) PDF ( )

Knowledge map

Save

Song sentiment analysis has not been satisfactorily addressed in audio signal processing community. In this paper,the lyric is used as proof for song sentiment analysis and the sentiment vector space model (s-VSM) is proposed to represent given lyric. Compared to the word-based vector space model (w-VSM), the s-VSM model successfully addresses the critical issues on text representation efficiency, ambiguity, functionality and data sparseness. Furthermore, the two-dimension Thayer sentiment stress model, i.e. light-hearted and heavy-hearted, are extended to a four-dimension model to incorporate two extra sentiment stress levelscomplicated and implied level. Experiments show that 1) the s-VSM model outperforms the traditional methods; and 2) the four-dimension sentiment stress model is helpful to further improve performance of song sentiment analysis.
Key wordscomputer application; Chinese information processing; sentiment analysis; sentiment vector space model; sentiment stress

Select

Review

Combining Multiple Chinese Word Segmentation Results for Statistical Machine Translation

MA Yongliang, ZHAO Tiejun

2010, 24(1): 104-110.

Abstract ( ) PDF ( )

Knowledge map

Save

In Chinese-English statistical machine translation (SMT), Chinese texts usually demands Chinese word segmentation (CWS) to identify the words in a sentence. However, CWS is not developed for SMT and hence its results are not necessarily optimal for SMT. In recent years, many investigations have been performed concerning making CWS suitable for SMT, but we explore it from another direction. In this paper, our basic idea is to use multiple CWS results as additional language knowledge source and we present a simple and effective approach to use multiple CWS results for SMT. We also give experiment results over a series of combining strategy, and the best result shows 1.89 percentage gain in BLEU points over a start-of-the-art SMT system.
Key wordsartificial intelligence; machine translation; statistical machine translation; Chinese word segmentation; feature interpolation of translation model; multi-strategy feature blending of translation model

Select

Review

Word Realignment for Statistical Machine Translation

XIAO Tong, LI Tianning, CHEN Rushan, ZHU Jingbo, WANG Huizhen

2010, 24(1): 110-117.

Abstract ( ) PDF ( )

Knowledge map

Save

Word alignment is one of the key techniques in statistical machine translation (SMT). In this paper, we propose a word realignment method, which recognizes the inconsistent parts between the bidirectional alignments generated by IBM models at first, and refines then the word alignment by realigning the inconsistent parts. To reinforce our method, a monolingual feature is used to make benefits from large-scale monolingual corpus. The effectiveness of the method is demonstrated on a state-of-the-art phrase-based SMT system. The experimental results show that our method can achieve higher translation accuracy than the widely-adopted heuristics-based method.
Key wordsartificial intelligence; machine translation; statistical machine translation; word alignment; word realignment; IBM models

Select

Review

Study on Japanese Word Segmentation and POS Tagging Based on Rules and Statistics

JIANG Shangpu1,2 , CHEN Qunxiu1,2

2010, 24(1): 117-123.

Abstract ( ) PDF ( )

Knowledge map

Save

Word segmentation and part-of-speech tagging is the first step of Japanese natural language processing tasks, such as machine translation in which Japanese is the source language. In this paper, a Japanese word segmentation and POS tagging approach based on rules and statistics is proposed. Adopting a single perceptron based joint word segmentation and POS tagging algorithm as the basic framework, this method is combined with the features of adjacency attributes which are derived by heuristic rules. The experiment on a small test dataset shows that the new approach achieves an F-score of 98.2% on word segmentation, and 94.8% on both word segmentation and POS tagging. This work has already been applied into the Japanese-Chinese machine translation system successfully.
Key wordsartificial intelligence; machine translation; Japanese-Chinese machine translation system;Japanese word segmentation;Japanese POS tagging;joint word segmentation

Select

Review

Chinese Chunk Parsing Evaluation Tasks

ZHOU Qiang, LI Yumei

2010, 24(1): 123-129.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper introduces three chunk parsing tasks carried in current CIPS parsing evaluation workshop (CIPS-ParsEval-2009), which is organized by Tsinghua University and Northeastern University. They are base chunk parsing, functional chunk parsing and event description clause recognition tasks. The designing motivation and the classification standards of these three chunks are discussed in the paper. Based on the detailed syntactic annotations in Tsinghua Chinese Treebank (TCT), three benchmark chunk banks automatically extracted from TCT are built. The evaluation results of top-5 participating systems are also given. The data analysis from their statistics and the comparison with current chunk schemes show some characteristics of these three chunk parsing tasks.
Key wordscomputer application; Chinese information processing; base chunk; functional chunk; event description clause; chunk banks

Please choose a citation manager

Content to export

2010 Volume 24 Issue 1 Published: 15 February 2010