Journal of Chinese Information Processing

Select

Review

A Survey on Chinese Chunk Parsing

LI Yegang1,2, HUANG Heyan1

2013, 27(3): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Chunking, as a typical shallow parsing, serves for many language information processing system for their demands on syntactic information, as well as a bridge between the lexical analysis, syntactic parsing and semantic parsing. This paper surveys the rich researches on chunking in several aspectsthe definition and classification of chunks, the chunks identification, the chunks annotation and evaluation, and the internal relationship in chunks. Finally, this paper draws conclusions and discusses the future work.
Key wordsChinese information processing; shallow parsing;chunk parsing; chunk identification

Select

Review

A Survey of Syntactic Parsing Based on Statistical Learning

WU Weicheng1, ZHOU Junsheng1, QU Weiguang1,2

2013, 27(3): 9-20.

Abstract ( ) PDF ( )

Knowledge map

Save

Syntactic parsing is one of the fundamental issues in natural language processing. In recent years, much effort has been devoted to syntactic parsing, resulting in a variety of approaches based on statistical learning. This paper systemically summarizes and classifies various approaches to syntactic parsing from the view of the statistical learning models and algorithms, focusing on the analysis and comparison of the different types of models and algorithms. The current researches on the Chinese syntactic parsing are also presented in this paper. Finally we give the future directions and trends in syntactic parsing research, especially for Chinese syntactic parsing.
Key wordssyntactic parsing; statistical learning model; generative model; discriminative model; shift-reduce; data oriented parsing

Select

Review

Survey of Discourse Analysis Methods

XU Fan, ZHU Qiaoming, ZHOU Guodong

2013, 27(3): 20-33.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse, a kind of text analysis granularity beyond word and sentence, plays a crucial role in natural language understanding and generation. This paper surveys the state-of-the-art researches in Chinese and English discourse analysis under the perspective of computational linguistics, including the applications of Chinese and English discourse analysis, the process of constructing a full Chinese and English discourse parser according to different discourse theories, discourse corpus and evaluation, as well as algorithms and detailed implementation. Also, this paper outlines several directions for further researches on discourse analysis.
Key wordsdiscourse; discourse analysis; corpus; evaluation

Select

Review

A Survey of Narrative Generation Approaches

ZHU Feng 1, 2,CAO Cungen1

2013, 27(3): 33-41.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of artificial intelligence and natural language processing, research on narrative generation gradually attatcts more concern and attention. This paper introduces the related concepts, background and current research status about narrative generation. From the perspective of research methodology, a survey is made in this paper for narrative generation, and the related work is summarized into three major typesthe automated planning based approach, the commonsense knowledge and ontology based approach, and the story grammar based approach. The fundamental ideas and key techniques these approaches are analyzed. The limitations and future work are also discussed.
Key wordsnarrative generation; story generation; narrative intelligence; natural language generation

Select

Review

Research on Phonetic Symbols of Phonograms in Chinese Mandarin

HU Renfen1,CAO Bing2,DU Jianyi3

2013, 27(3): 41-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Most of Chinese characters are phonograms. When creating a phonogram, people used an ideographic symbol and a phonetic symbol to show the meaning and phoneme of the character. However, as time goes on, some phonetic symbols could not exactly indicate the phonemes any more, which makes it difficult for people to read words correctly. This paper utilizes mathematical modeling approach to study 3500 commonly used Chinese characters in Mandarin. By integrating linguistic theories and computer science methods, the paper aims at making a systematic and comprehensive research on the phonetic symbols in phonograms, so as to provide important references and evidences to the formulation of Chinese characters specifications and language teaching.
Key wordsChinese characters; phonogram; phonetic symbols; phonetic indication; cluster analysis

Select

Review

Improvements on Mandarin Pronunciation Evaluation

QI Xin1, XIAO Yunpeng1, 2, YE Weiping1

2013, 27(3): 48-56.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, PNCC(Power-Normalized Cepstral Coefficients) is introduced into Mandarin pronunciation evaluation system for reducing the impact of background noise. The result shows that the score correlation based on PNCC has been increased by 6.6% compared with classical MFCC. Then, different initial-final acoustic model structures for Chinese syllables are investigated on Mandarin pronunciation evaluation. An initial-medial and final (IMF) modeling is applied, resulting 5.6% reduction of the error rate and an increase of 0.056 score correlation. Finally, the number of states in HMM model is discussed for pronunciation scoring, and some mixed score computing schemes based on either models or scores are proposed. Test results show the score correlation with the experts has been increased by 0.021 and 0.017 respectively.
Key wordsmandarin pronunciation evaluation; PNCC; initial-medial and final; HMM states

Select

Review

Out-of-vocabulary Word Rejection Based on Output Probability Distribution

HUANG Shilei1, 2, LIU Yi2,CHENG Gang2

2013, 27(3): 56-61.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an Out-Of-Vocabulary (OOV) word rejection method based on the Output Probability Distribution (OPD) of phoneme HMMs in word verification. Compared with input vector for dynamic garbage model, OPD vector contains more information than the sorted probabilities. Confidence score of each phoneme is calculated by SVM with OPD vectors as input to determine the acceptance or rejection of the hypotheses. Experimental results show that the proposed method achieved 11.0% decrease in EER than the conventional dynamic garbage model in word verification task.
Key wordsspeech recognition; word verification; confidence measure

Select

Review

Research on the Corpus Effect to the Chinese Noun Phrase Anaphora Resolution

GAO Junwei, KONG Fang, ZHU Qiaoming, LI Peifeng

2013, 27(3): 61-69.

Abstract ( ) PDF ( )

Knowledge map

Save

Coreference is a common phenomenon in natural language, with a great effect in making the natural language clear and explicit illusions. Coreference resolution is the process to detect these phenomena by the computer. A great deal of research has been conducted on this task in English with substantial achievements in recent years. However, much less work has been done in this area in Chinese. One problem is the lack of public Chinese corpus for this research in except for ACE2005, OntoNotes and so on. To discuss the effect of the corpus to the Chinese Noun Phrase Anaphora Resolution, we present a Chinese noun phrase coreference resolution system that based on supervised learning approach and another system that based on unsupervised clustering approach. We discussed the effect of the corpus to the Chinese noun phrase coreference resolution based on the two platforms from the quantity and the quality of the corpus.
Key wordscoreference resolution; noun phrase; unsupervised; clustering; corpus

Select

Review

Identify Sentiment-Objects from Chinese Sentences Based on
Cascaded Conditional Random Fields

ZHENG Minjie1,LEI Zhicheng2,LIAO Xiangwen2,CHEN Guolong2

2013, 27(3): 69-77.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentiment-objects extraction aims to identify the targets of opinion described in sentiment sentences. However, previous researches fail to extract compound targets and unknown words. In this paper, the cascaded CRFs model is presented to deal with the problem. The method first acquires opinion target set using lower-lever CRFs model. then, middle-lever models is employed to get candidate set by filtering noise, complementing missing candidate targets, and merging compound noun phrases. Finally, opinion targets set is extract from the higher-lever model using middle-lever model candidate set as input. Experiments show that our method outperforms linear chain CRFs by 1.62% in precision, 5.75% in recall, and 4.17% in F1 measure. Meanwhile, the method is also effective to identify the compound targets and unknown targets.
Key wordssentiment-objects; cascaded conditional random fields; noise reduction model; complement model

Select

Review

Automatic Text Error Detection in Domain Question Answering

LIU Liangliang1,2, WANG Shi1, WANG Dongsheng1,2, WANG Pingze1,2, CAO Cungen1

2013, 27(3): 77-84.

Abstract ( ) PDF ( )

Knowledge map

Save

Text automatic proofreading is an important research issue in NLP, and still remaing as an challenge. This paper analyzes the type and the cause of Chinese errors, and proposes an automatic detection of typos based the user query log in the domain Question Answering System. First the word segmentation is performed on the corpus, then fragments in the word segmentation result are merged, After clustering the multi-character words and the merged strings, the approach gets typos pair automatically according to the contextual analysis of similar strings. The experiment show that the recall rate is 71.32% and accuracy rate is 82.6% for this method in actual question answering system logs.
Key wordstext automatic proofreading; question answering system; no-word error; real-word error; typos pair

Select

Review

Example Phrase Based Chinese-Tibetan Computer Aided Translation

XIONG Wei1,2, WU Jian1, LIU Huidan1,2, ZHANG Liqiang1

2013, 27(3): 84-91.

Abstract ( ) PDF ( )

Knowledge map

Save

At present, the research on Chinese-Tibetan machine translation is focused on rule-based methods. Due to the lack of parallel corpus and other resources between Chinese and Tibetan, it is almost impossible to carry statistical experiments on Chinese-Tibetan machine translation. According to the actual needs of the Chinese-Tibetan Computer Aided Translation, this paper proposes an example phrase based machine translation method. It can fully take advantage of the existing parallel corpus resources using the word-align information to improve the translation quality. Allowing the retrieval of arbitrarily long phrase examples, this approach is proved for a better performance than the example based method on sentence level. On the test data, the method achieves a comparable performance with Moses. The recall of translation phrase makes an improvement of 9.71% over Moses. The translation speed is about 0.175s per sentence, which meets the requirement of the computer aided translation system.
Key wordsmachine translation; computer aided translation; phrase-based translation; example-based translation

Select

Review

A Parallel Pages Mining Approach: Combining URL Patterns and HTML Structures

LIU Qi, LIU Yang, SUN Maosong

2013, 27(3): 91-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Parallel corpus is the fundamental resource for statistical machine translation, cross-lingual information retrieval and others information processing technologies. Although the amount of parallel data on the web is continually increasing, the heterogeneity and complexity of parallel website make it still a challenge to collect such parallel texts. This paper presents a new parallel web pages mining approach, which combines URL patterns and HTML structure together. First, we use HTML structure to recursively visit parallel pages. Then, URL patterns are used to optimize the traverse sequence of parallel web site topology. Thus an efficient and accurate parallel pages mining system is relaized. Compared with traditional approach, experiments on two parallel web sites(www.un.org and www.gov.hk¹) show that this approach saves more than 50% processing timeand improves 15% accuracy, resulting a significant increase in the translation quality of MT System.
Key wordsparallel pages mining; parallel corpus; URL pattern; HTML structure

Select

Review

Constructing Word Association Network by Crowdsourcing

DING Yu, CHE Wanxiang, LIU Ting, ZHANG Meishan

2013, 27(3): 100-107.

Abstract ( ) PDF ( )

Knowledge map

Save

Dictionaries are crucial to the natural language processing. Its a fundamental resource for Chinese word segmentation, POS tagging, parsing and so on. This paper presents a method to build semantic relevance dictionary with crowdsourcing, which is triggered by the word association indirectly. Compared with traditional dictionaries, the so called word association network has following advantages1)Low cost; 2)Internet oriented and easy to expend;3)Word relationship is determined from the perspective of human cognition and is consistent with human intuition. In addition to describing the way of building word association network, we also analyzed the data obtained, comparing it with Hownet, TongYiCi CiLin and word ngrams from Weibo to show its characteristics.
Key wordscrowdsourcing; semantic relevance dictionary; word association network

Select

Review

Definite Null Instantiation Detection in FrameNet

LEI Zhangzhang1, WANG Ning1, LI Ru1 2, WANG Zhiqiang1

2013, 27(3): 107-113.

Abstract ( ) PDF ( )

Knowledge map

Save

In FrameNet, definite null instantiation detection aims to find null instantiation of the frame elements which need to be filled in frame semantic annotation corpus, which is beneficial for text understanding. This paper proposed a simple two-stage pipeline solution to definite null instantiation recognizingthe first stage used rule-based approach to detect null instantiations in the corpus which have been semantic roles labeled, and the second stage predicts which types the null instantiations previously detected belongs to based on maximum entropy. The results of test data from SemEval-2010 Task 10 show that the recall of null instantiation detection and the precision of null instantiation classification are 60.1% and 53.5%, respectively, close to the best result of the evaluation.
Key wordsFrameNet; definite null instantiation recognizing; maximal entropy

Select

Review

Document Clustering Based on Word Sense ClusterT

ANG Guoyu1 , XIA Yunqing1 , ZHANG Min2, ZHENG Fang1

2013, 27(3): 113-120.

Abstract ( ) PDF ( )

Knowledge map

Save

Document representation is the key part in document clustering. In this paper, we aim at improving document representation in document clustering. Synonymy and polysemy are two challenging issues in document representation. Inspired by the observation that synonymy and polysemy are mainly related to word sense, we present a novel model, referred to as Sense Cluster Model (SCM), to address both issues by representing documents with word sense clusters. In SCM, word sense clusters are first constructed from the development dataset by 1) the word sense induction to automatically discover different senses of each word from raw text; and 2) the word sense clusteringto recognize identical or similar words. Then the probability distribution over word sense clusters is generated to represent every document after word sense disambiguation. The experiments conducted on benchmarking data show that the SCM model outperforms both baseline and the classic topic model, LDA, in the task of document clustering.
Key wordsword sense; document representation; topic model

Select

Review

Semi-Supervised Sentiment Classification with a Ensemble Strategy

GAO Wei,WANG Zhongqing,LI Shoushan

2013, 27(3): 120-127.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentiment classification aims to predict the sentimental orientation expressed in the text. In this paper, we investigate the semi-supervised approaches for sentiment classification in a ensemble learning framework where a abound of unlabeled data is leveraged to enhance the classification performance together with a small amount of labeled data. To improve the performance of the semi-supervised learning approach, we propose a novel ensemble method based on label consistency. Specifically, we combine two popular semi-supervised methodsco-training with random feature subspaces and label propagation to generate the pseudo labeled data for updating the initial labeled data. First, the unlabeled data are labeled by the two semi-supervised learning approaches separately. Then, the unlabeled samples with the consistent labels are considered as pseudo labeled data. Finally, the labeled data is updated with the pseudo labeled data. Experimental study shows that our approach is capable of effectively reducing the error of the pseudo labeled data and thus achieves much better performances than some other approaches for semi-supervised sentiment classification.
Key wordssentiment classification; semi-supervised learning; ensemble learning

Please choose a citation manager

Content to export

2013 Volume 27 Issue 3 Published: 15 June 2013