Journal of Chinese Information Processing

Select

Language Analysis and Generation

Towards a Rule-based Approach to Automatic Interpretation of Chinese Noun Compounds

WEI Xue, YUAN Yulin

2014, 28(3): 1-10.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a rule-based approach to interpret Chinese ‘N+N’ compounds automatically. The working procedures are: 1) Establishing the semantic class patterns for noun compounds according to the semantic classification in Semantic Knowledge-base of Contemporary Chinese. 2) Revealing the semantic relation between the nouns in N+N′ compounds by taking the Agentive Role or Telic Role of a certain noun as the paraphrasing verb. 3) Designing one interpretation template or more for every semantic class pattern, and building the database of N+N′ combination to record the semantic class patterns and the Paraphrasing Verbs. 4) Building the database of Noun_Verb, which contains the Agentive Role and/or Telic Role of each noun by using the HowNet. Based on these two databases, a mechanis is finally achieved to generate the interpretation of the Chinese noun compounds automatically.

Select

Language Analysis and Generation

Cohesion-driven Discourse Coherence Modeling

XU Fan1, ZHU Qiaoming2, ZHOU Guodong2, WANG Mingwen1

2014, 28(3): 11-21.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper systematically explores the impact of cohesion theory in Discourse Coherence Modeling (DCM). Different from the state-of-the-art supervised entity-based and discourse relation-based grid models, our unsupervised model shows the importance of the theme-rheme structure, a cohesion theory of systemic-functional grammar, to DCM, and the appropriateness of theme and coreference based filtering mechanism to discourse consistency in DCM. Evaluation on three publicly available benchmark data sets via sentence ordering and summary coherence rating tasks shows the effectiveness of both theme-rheme structure and coreference resolution in DCM. It also shows that our system significantly outperforms the state-of-the-art ones.

Select

Language Analysis and Generation

Research on Pragmatic Function of Verbs Addressing New Branch Topic

JI Cui, LU Dawei, SONG Rou

2014, 28(3): 22-27.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese is a topic-prominent language. In Chinese discourse, a single topic can be discussed at length, but there can also be changes in topic. This paper focuses on a specific kind of topic change named new branching topic, in which. parts of the comment of original topic address a new topic, while the new topic and its comments cannot constitute into a sentence with the original topic. This paper discusses the capacity of verbs addressing an object as a New Branch Topic, classifying the verbs according to their semantic categories and listing the semantic distribution statistics of all the verbs with such function in Fortress Besieged.

Select

Language Analysis and Generation

Sentiment Orientation Analysis of English Sentences with Modality

CHEN Zhongshuai, LIU Yang, YU Xiaohui

2014, 28(3): 28-35.

Abstract ( )

Knowledge map

Save

This paper analyses sentiment orientation of English sentences with modality. Sentences with modality are used widely in English, which comprise a significant proportion of typical reviews corpus. Due to the unique characteristics of modality, it is challenging for a general sentiment analysis system to handle these sentences. This paper identifies these sentences with the help of POS tagging and present a new modal feature that has been rarely discussed in previous studies. To further improve the accuracy, we develop a novel method which can effectively combine phrases sharing similar meanings of modality. The experimental results illustrate that the F-score of the proposed method increases by 4% and 7% than classic methods in the two-class and three-class sentiment orientation classifications, respectively.

Select

Language Analysis and Generation

Semantic Role Labeling of Chinese FrameNet Based on Conditional Random Fields

SONG Yijun1,WANG Ruibo1,LI Jihong1, LI Guochen2

2014, 28(3): 36-47.

Abstract ( ) PDF ( )

Knowledge map

Save

Given a predicate word and its frame, semantic role labeling of Chinese FrameNet can be divided into two steps: the boundary identification of semantic roles and the classification of semantic roles. In this paper, these tasks are formalized onto the word sequential labeling problem through IOB2 strategy. We apply conditional random field model to automatic labeling experiment with word as the basic tagging unit. We extract 15 new base-chunk features by applying the base chunk parser of Tsinghua University to automatic parsing on sentences, and the features are formalized onto the word sequence. Experiments show that the F1-value of the total performance of semantic roles labeling increases by nearly 1% in comparison with the baseline, which is significant under 0.05 significance level of the t-test.

Select

Language Analysis and Generation

Frame Selection for Unknown Lexical Units from Chinese FrameNet

CHEN Xueli1, LI Ru1,2, WANG Sai1, WANG Zhiqiang1

2014, 28(3): 48-54.

Abstract ( ) PDF ( )

Knowledge map

Save

The low coverage of Chinese FrameNet leads to many unknown lexical units and restricts the frames semantic analysis for Chinese. In order to identify frames for unknown lexical units, this paper proposes two methods based on Tongyici CiLin: the Average Semantic Similarity method and Maximum Entropy (ME-based) method which both combine the static features and dynamic features. Experiments show that the two methods can effectively identify the frame of unknown lexical units: the accuracy of the similarity-based method is 78.61% considering Top-4 candidates; the Top-1 accuracy of the ME-based method for the same test set is 87.29% (and 75% for another news texts).

Select

Information Retrieval and Social Computing

Research on Microblog Information Diffusion Network Structural Properties

WANG Xiaoming, WANG Li, YANG Jingzong

2014, 28(3): 55-61.

Abstract ( ) PDF ( )

Knowledge map

Save

Microblog is widely used nowadays. While its users interaction structure is complex, a novel method is proposed in this paper to analyze the property of microblog information diffusion network. We first give the definition of the information source. Then information diffusion networks for six different topic events are visualized and analyzed. Information diffusion network is modeled as a directed acyclic graph, and three motif structures are defined to present information scattering, information gathering and information transmitting, respectively. According to the Spearman rank correlation coefficient, the distributions of the three motif structures are quite different from each other. As for the information diffusion network evolution, it is dount that the information scattering structure has the largest number at each snapshot.

Select

Information Retrieval and Social Computing

Research on Detecting Spammer in Micro-blogs

LI Heyuan 1,2, YU Xiaoming 1, LIU Yue 1, CHENG Xueqi 1, CHENG Gong3

2014, 28(3): 62-67.

Abstract ( ) PDF ( )

Knowledge map

Save

Micro-blogs changes the way people obtain information. However, Micro-blogs has been infiltrated by large amount of spam, which is a challenge to normal user. In this paper, we research on spam in Chinese Micro-blogs. We study the behavior of spam user and propose 7 new features for detecting them. Then, we describe how to apply features into detecting spammer via a SVM classifier. The experiment results indicate that the accuracy and recall of the proposed method is satisfactory.

Select

Information Retrieval and Social Computing

WAN Shengxian1,2, GUO Jiafeng 1, LAN Yanyan 1, CHENG Xueqi1

2014, 28(3): 68-74.

Abstract ( ) PDF ( )

Knowledge map

Save

Tweet popularity prediction in social network is very important for applications such as information recommendation and viral marketing. This paper proposes a new approach for tweet popularity prediction based on propagation simulation. The maximum entropy model is firstly used to learn the probabilities of users retweeting behaviors, and then the independent cascade model is used to simulate the diffusion processes of tweets in real social network. This approach benefits from using more information of social network structure and users. Experiments on Twitter dataset show that our approach is better in both precision and stability compared to baselines.

Select

Information Retrieval and Social Computing

Research on Long-tail Query Search Performance Evaluation

HUO Shuai, ZHANG Min, LIU Yiqun, MA Shaoping, JIN Yijiang, RU Liyun

2014, 28(3): 75-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Search engines are committed to helping people find target information accurately and quickly, hence the evaluation of search performance becomes more vital, This paper deals with the rare queries performance evaluation which is less touched. First, three types of features are extracted after analyses of rare queries characteristics. Second, correlation of the features is analyzed and different combinations of features are tested. Then, two data balancing approaches are raised to alleviate the serious imbalance of the data set. Finally the evaluation method for rare queries is put forward and then improved. The experimental results show that the proposed evaluation approach is effective, by which the identification of non-relevant results achieves encouraging precision.

Select

Machine Translation

A Survey of Automatic Machine Translation Evaluation

LI Liangyou, GONG Zhengxian, ZHOU Guodong

2014, 28(3): 81-91.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of machine translation, the automatic evaluation methods have been paid more and more attention. Since so many related methods and technologies have been proposed, it is a big challenge to organize and describe them with a scientific classification. This paper focuses on three types of methods, i.e. Checkpoint-based methods, String-matching methods and Machine Learning based method. This paper enumerates several representative approaches for each type of method, describing the principle of metrics and analyzing advantages and shortcomings of them. In addition, the sub-branch of evaluation with limited references is also introduced as a special catalog, which plays an important role in increasing the degree of automation as well as boosting the performance. Furthermore, some famous evaluation metric campaigns are introduced. Finally, we show the trend of current researches on automatic evaluation and point out some relevant problems for future study.

Select

Minority Language Information Processing

The Algorithm of Spelling Check Base on TSRM

ZHU Jie1,2, LI Tianrui1, LIU Shengjiu1

2014, 28(3): 92-98.

Abstract ( ) PDF ( )

Knowledge map

Save

As an fundamental issue of text processing, spelling check is implemented in a wide range of fields, such as word processing, character recognition, voice recognition, search engine. According to the word formation rule of the Tibetan voice features, the paper proposes an algorithm for spelling check of Tibetan syllable via a simplified model of Tibetan syllable rules. Results of two experiments verify the effectiveness of the algorithm. Without considering the special case of Tibetan syllables, the accuracy of spelling errors check rate reaches 99.8%.

Select

Minority Language Information Processing

Tibetan syntax Formal description Based on FUG

TashiGyal1,DuoLa2

2014, 28(3): 99-103.

Abstract ( ) PDF ( )

Knowledge map

Save

According to actual need of Tibetan natural language processing, , this paper adopts the complex feature set and function unification for formal description of Tibetan sentence. In light of the modern linguistic theory, this paper explores the frame representation for function unification of the Tibetan word, syntax, semantic rules.

Select

Minority Language Information Processing

Study on Recognition Algorithms for Tibetan Construction Elements

Bianba wangdui, Zhuoga, CHEN Yanli, WU Qiang

2014, 28(3): 104-111.

Abstract ( ) PDF ( )

Knowledge map

Save

To implement Tibetan sorting algorithm, the recognition of construction elements which compose Tibetan syllable must be solved, on which the sorting can be accomplished according to the priority. Through the study on the Tibetan morpheme structure, spelling law and grammar rules, a novel algorithm is designed for modern Tibetan construction elements recognition. Ambiguity, double vowel and abbreviation of Tibetan special syllable is considered in the algorithm. In addition, to guarantee right recognition in Tibetan Standard of China, corresponding processing is adopted in the algorithm. The test shows that the algorithm can meet the actual demands of the recognition of Tibetan construction elements.

Select

Minority Language Information Processing

A Uyghur Online-Handwritten Word Recognition System

Riyiman Tursun, Wushour Silamu

2014, 28(3): 112-115.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a system for Uyghur Online-Handwritten word recognition. According to the characteristics of the Uyghur word handwriting, the system adoptes a strategy based on multiple classifier combination, using Gaussian Mixture Model forthe static image and Hidden Markov Model for the dynamic writing trajectory of the handwritten word, respectively.The combination of multiple classifiers improves the recognition accuracy effectively. In the preliminary experiments, our system achieves an accuracy of 97% and 99%, respectively.

Select

Speech Recognition and Analysis

Chinese Stop Detection Based on Energy Change Rate

ZHANG Lianhai, CHEN Bin, QU Dan, LI Bicheng

2014, 28(3): 116-122.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to solve the issue of unreliable burst spectrum feature, a Chinese stop detection method based on energy change rate characteristic is proposed. The energy change rate features are first acquired from the Seneff's auditory spectrum, and then transformed by Fisherface approach. Finally the KNN classifier is implemented to realize stop detection. Tested by leave-one-out cross validation, the results indicate a good performance of high stability and generalization: the accuracy is 96.39% for clean speech and 88.07% for noisy speech with the SNR of 10dB.

Select

Speech Recognition and Analysis

Introduction to Automatic Labeling/Retrieving System for Acoustic Parameters

ZHOU Xuewen, HU He

2014, 28(3): 123-128.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents an Automatic Labeling/Retrieving system for acoustic parameters. By using the system, phnetic analysts may dramaticlly deduce errors in labeling and retrieving acoustic parameters, improve working efficiency, ensure repeatbility and verifibility of phonetic data and promote standarization in establishing acoustic parameter databases.

Select

Speech Recognition and Analysis

Effects of Topic Transition and Sentence Length on Acoustic Cues in Mandarin Chinese

WU Qian, WANG Bei

2014, 28(3): 129-135.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper investigats the effects of topic transition type and sentence length on pause, final lengthening and pitch reset at prosodic phrase boundaries between two clauses. The discourses contained two sentences each. The second sentence is manipulated to control length (long vs short) and topic transition type(continuation, elaboration or shift).The results from twenty native speakers show that: 1) Both topic transition and sentence length have significant effects on pause duration and pitch reset, but not on pre-boundary lengthening, with no interaction between them. More specifically, longer pause and larger pitch reset occurre when the second sentence is long. Pause duration and pitch reset are increased to a larger degree in the condition of topic shift than topic elaboration and continuation. 2) A weak negative correlation is found between pause duration and pre-boundary lengthening. And, there is a weak positive correlation between pause duration and pitch reset. (3) Compared with male speakers, female speakers use both pitch and duration variation to mark topic transition type in a more systematic way. The above results suggest that the effect of sentence length on acoustic cues at intonational phrase boundaries is probably articulatory, whereas that of topic transition type is communicative.

Select

Information Extraction and Text Mining

Chinese Comparative Sentences Identification and Comparative Elements Extraction Based on Semantic Classification

ZHOU Hongzhao, HOU Mingwu, HOU Min, TENG Yonglin

2014, 28(3): 136-141.

Abstract ( ) PDF ( )

Knowledge map

Save

Comparison is a common expression to assess which is better or whether they are identical (or similar) in some aspects among several things. How to identify comparative sentences and extract the elements being compared automatically is a novel and practical research in the sentiment analysis field. Based on the interdependent relationship between comparative sentences and comparative elements, we propose a method to accomplish the two identification tasks simultaneously. According to the semantic classification of words and comparative sentences, we construct the lexicon system consisting of a domain lexicon, a sentiment lexicon, a mark lexicon and a common lexicon, and them build a rule base of comparative sentences identification and comparative elements extraction. On the testing corpus published by The Fourth Chinese Opinion Analysis Evaluation (COAE2012), the experiments demonstrate a promising .e. evaluation) result by the proposed method.

Select

Information Extraction and Text Mining

Topic Evolutionary Analysis for Dynamic Topic Number

FANG Ying1,2,HUANG Heyan1, XIN Xin1, WEI Xiaochi1, ZHUANG Kun1

2014, 28(3): 142-149.

Abstract ( ) PDF ( )

Knowledge map

Save

Topic evolution for the topic changing trend analysisis of significance in both application and research. On the basis of LDA (Latent Dirichlet Allocation) model, ILDA (Infinite Latent Dirichlet Allocation) model is enhanced with a Dirichlet process. The ILDA model can not only acquire the latent variable, but also update the super-parameters and change the topic number dynamically. In the existing topic evolution systems, the topic number is pre-defined without permission to change. The method based on ILDA model aims to resolve this by enabling the following: different topics for classification in each cycle, topic association between adjacent cycles and the sub topic strength calculation in time sequence. The experiments show that the variable updating of the parameters meet the actual demand, resulting a satisfactory process of topic evolution analysis.

Please choose a citation manager

Content to export

2014 Volume 28 Issue 3 Published: 10 March 2014