Journal of Chinese Information Processing

Select

Language Analysis and Generation

Lexical Issues in Chinese Information Processing:in the Background of Sentence-based Diagram Treebank Construction

PENG Weiming, SONG Jihua, YU Shiwen

2014, 28(2): 1-7.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper compares the Sentence-based DiagramTreebank with existing lexical specification in the aspect of word segmentation unit and POStagging, revealing the disjunction between automatic lexical analysis and parsing in the current Chinese information processing.It describes the parsing strategy of some special structures such as nonce formation and idiomsin the Diagram Treebank as well as their linguistics rationale. It also explores the implementation of the Chinese word classtheories such as “For All Words,the Word-class Is Based on the Sentence” and “Referentiality” in Chinese information processing.

Select

Language Analysis and Generation

Chinese Sentential Semantic Type Recognition Based on Predicate and Sentential Semantic Type Chunk

WANG Qian, LUO Senlin, HAN Lei, PAN Limin

2014, 28(2): 8-16.

Abstract ( ) PDF ( )

Knowledge map

Save

According to modern Chinese semantics, there are 4 semantic types (single, complex, compound and multiple). Attempted to capture the overall sentential semantic structures, sentential semantic type recognition is an important step to the whole sentential semantic structure parsing. This paper proposes a 4-semantic-types recognition method based on predicate and sentential semantic type chunk. This method firstly identifies some single semantic type sentences by the predicate number in each sentence. For the rest sentences, C4.5 algorithm is applied to get the maximum number of sentential-semantic-type chunk of predicates in sentential semantic structure, and then the sentential semantic type of each sentence is identified by combining the top sentence node in syntax structure. The experimental data contains 10221 sentences chosen from Beijing Forest Studio-Chinese Tag Corpus. The accuracy rate of sentential semantic type is up to 97.6% in open test.

Select

Language Analysis and Generation

Predicting Implicit Discourse Relation Based on Functional Connective

CHE Tingting, HONG Yu, ZHOU Xiaopei, YAN Weirong, YAO Jianmin, ZHU Qiaoming

2014, 28(2): 17-27.

Abstract ( ) PDF ( )

Knowledge map

Save

The functional connective is a word feature that directly expresses interior semantic relations, structure characteristics and the development trend of context of discourse units. Based on the functional connective, this paper puts forward a kind of methods for predicting relations of implicit discourse. First, this method mines functional connectives at the word and phraselevel, and divides the discourse relationcategory of functional connectives. Secondly, it buildsthe concept model for each type of functional connectives to describe argument attributes connected by functional connectives,and establishes the mapping system between argument concepts and discourse relations; Finally, the predictions of the implicit discoursesemantic relationis realized by statistical strategy to recognize conceptual model of argument and with “concept-relations” mapping system. The experimental results show that, the predicting method byconstructing concept model based on functional connectives, got the significant performance improvementscompared to the existing classification method based on supervised learning.

Select

Language Analysis and Generation

Chinese Discourse Relation Semantic Taxonomy and Annotation

ZHANG Muyu, QIN Bing, LIU Ting

2014, 28(2): 28-36.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse Relation is an important part of discourse semantic analysis. This paper analyses the differences between Chinese and English discourses, then presents the first Chinese discourse relation taxonomy based on the English discourse relation researches in details. Aiming at the rationality of the hierarchy, we conducts annotation experiments on Chinese internet news texts and analyses all difficulties happened during the data annotation together with the resolution to lay a foundation for the future discourse semantic analysis.

Select

Language Analysis and Generation

A New Measure of Semantic Similarity Based on Hierarchical Network of Concepts

WU Zuoyan, WANG Yu

2014, 28(2): 37-43.

Abstract ( ) PDF ( )

Knowledge map

Save

A new measure based on Hierarchical Network of Concepts(HNC) theory is put forward to compute the semantic similarityin natural language. Based on the coding rules and the map theory included in the concept expression form in the vocabulary relation level of HNC, the method integrates the concept of connotation, outward features, classification and combination of symbol to calculate semantic similarity. This method is compared with the current popular similarity methods based onHowNetaccording to the subjective judgment of human. Experiment showsthat the method has a good performance, which can distinguish the differences between different words more accurately.

Select

Machine Translation

Three Ways to Incorporate Bilingual Phrases into Dependency-to-String Model

XIE Jun, LIU Qun

2014, 28(2): 44-50.

Abstract ( ) PDF ( )

Knowledge map

Save

Dependency-to-String model makes use of translation rules based on head-dependents relations, which consists of a head and all its dependents. This model is good at capturing sentence patterns and phrase patterns in the source language, but fails in capturing non-compositional phenomena(such as idiom and collocation)that can be captured easily by phrases. In order to better improve the performance, we propose three ways to incorporate syntactic phrases, generalized syntactic phrases and non-syntactic phrases into this model. Experiments show that this model gains up to about 1.0 BLEU score by incorporating these three kinds of phrases.

Select

Machine Translation

An Improved HMM Based Word Alignment Method

LIU Ying, JIANG Wei

2014, 28(2): 51-55.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper improves the HMM based word alignment by introducing syntactic knowledge. HMM is combined with English phrase structure tree distance to align Chinese-English words. Experiments shows that the improved HMM can reduce the error rate of word alignment, and improve the BLEU score of statistical machine translation.

Select

Minority Language Information Processing

Discriminative Tibetan Part-of-Speech Tagging with Perceptron Model

HUA que-cai-rang, LIU Qun, ZHAO Haixing

2014, 28(2): 56-60.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes a discriminative method for Tibetan part-of-speech tagging with perceptron training model. We focus on how to build the feature template that is in line with Tibetan lexical features, how to train discriminative models and the method of part-of-speech tagging. The method achieves an extremely high precision of 98.26% over a manually created test corpus, which shows that it’s a practical solution for Tibetan natural language processing.

Select

Minority Language Information Processing

Tibetan Word Segmentation Based on Discriminative Classification and Reranking

SUN Meng, HUA que-cai-rang, CAI Zhijie, JIANG Wenbin, LV Yajuan, LIU Qun

2014, 28(2): 61-65.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a discriminative model based approach for Tibetan word segmentation, which aims to investigate the influence of word-formation granularity and reranking strategy on the Tibetan segmentation. As for word-formation granularity, we discuss the influence of using the basic Tibetan character, the basic Tibetan character-tsheg as well as the syllable as word-formation unit on the Tibetan segmentation. The experimental results show that using syllable as word-formation unit obtains a highest F-measure score of 91.21%. And with a word lattice and shortest-path based reranking strategy, we further boost F-measure up to 96.25% on segmentation.

Select

Minority Language Information Processing

Spelling Check Method of Uyghur Languages Based on Dictionary and Statistics

Maihefureti,Aishan Wumaier,Maierhaba Aili,Tuergen Yibulayin,ZHANG Jian

2014, 28(2): 66-71.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we present spelling check method of Uyghur Languages based on a combination of dictionary and statistics. Firstly, we describe a stemming method based on the dictionary. Secondly, we proposed N-gram-based model to judge the suffix to a stem, detecting the misspelling unknown words at the same time. Finally, we present a spelling check method based a hybrid strategy of combining different methods. This method achieves improvements in accuracy, reliability, and so on.

Select

Information Extraction and Text Mining

Drug Name Entity Recognition Based on Feature Coupling Generalization

HE Linna, YANG Zhihao, LIN Hongfei, LI Yanpeng,TANG Lijuan

2014, 28(2): 72-77.

Abstract ( ) PDF ( )

Knowledge map

Save

Drug name recognition aims to find drugs in biomedical texts, which is a demanding technology in face of overwhelming drug researches. We adopt a semi-supervised learning method to build a dictionary and then use the combination of the dictionary and the Condition Random Field method to recognize the drug name entities. Firstly, we extract a drug name dictionary using template matching method and then Feature Coupling Generalization (FCG) is used to filter the dictionary. Finally, we combine the dictionary and the Condition Random Field method to recognize the drug entities. As a result, our method achieved an F-score of 0.7673 on the drug name recognition corpus.

Select

Information Extraction and Text Mining

Fast New Words Extraction from Multi-lingual Web Texts

LIU Bingyang, LIU Qian, ZHANG Jin, LIU Xinran, CHENG Xueqi

2014, 28(2): 78-84.

Abstract ( ) PDF ( )

Knowledge map

Save

Extracting new words from web texts is one key problem in the area of information processing with direct application in information retrieval, public opinion, dictionary compilation, Chinese word segmentation and other fields. A language-independent method is implemented to fast extract new words from web texts:Encoding multi-lingual texts into a uniform binary stream, extracting repeat strings, calculating the adjacency variety and string integrity measurement. Two suffix trees in 4-bit based structureare used to calculate these statistics in linear time. This method outputs new words and their order on both Chinese and English web texts.

Select

Information Extraction and Text Mining

Chinese Multi-Document Opinion Summarization via PageRank

LIN Liyuan, WANG Zhongqing, LI Shoushan, ZHOU Guodong

2014, 28(2): 85-90.

Abstract ( ) PDF ( )

Knowledge map

Save

Opinion summarization aims to refine the text data so as to generate a summary regarding the expressed opinion. This study focuses on multi-document opinion summarization where the main task is to generate a summary for a given amounts of reviews towards the same product. We first collect and annotate a Chinese multi-document corpus on product reviews. Then, a novel PageRank framework to generate opinion summarization is proposed, with the advantage of considering both the topic relation and opinion relation among reviews. Empirical studies on the corpus demonstrate that the proposed method substantially outperforms existing approaches in terms of ROUGE measurement.

Select

Information Extraction and Text Mining

The Effect of TongYiCi CiLin in Chinese Entity Relation Extraction

LIU Dandan, PENG Cheng, QIAN Longhua, ZHOU Guodong

2014, 28(2): 91-99.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic information plays an important role in the semantic relation extraction between named entities. Taking “TongYiCi CiLin” as an example, this paper systematically investigates the effectiveness of lexical semantic information on tree kernel-based Chinese semantic relation extraction, particularly the influence of different levels of semantic information and polysemy phenomenon, as well as details about the redundancy between lexical semantic information and entity type information. The experiments of relation extraction on the ACE2005 Chinese corpus shows that semantic information can significantly improve the extraction performance without entity types, while in the case of known entity types, semantic information can also noticeably enhance the extraction performance for some relation types. This implies a certain degree of complementarity between “CiLin” semantic information and entity type information in Chinese semantic relation extraction.

Select

Information Extraction and Text Mining

Event Relation Recognition by Event Term and Entity Inference

YANG Xuerong, HONG Yu, MA Bin, YAO Jianmin, ZHU Qiaoming

2014, 28(2): 100-108.

Abstract ( ) PDF ( )

Knowledge map

Save

Event relation recognition, as one of natural language processing technologies, faces information stream of texts detecting event relation. The key to event relation recognition is to detect latent logical relation (deciding whether events hold logical relation or not) between events by analyzing the corresponding discourse structure and semantic features of events, with the techniques of semantic relation recognition and inference. This paper proposes an event relation recognition method based on the inference of the event term and entityunder the same topic.Compared with the method based on dependency cue inference, theproposed method achieves 15.34% improvement.

Select

Information Extraction and Text Mining

Recognizing Textual Entailment Based on the Multi-features

ZHAO Hongyan,LIU Peng, LI Ru,WANG Zhiqiang

2014, 28(2): 109-115.

Abstract ( ) PDF ( )

Knowledge map

Save

Recognizing text entailment is an effective solution to the natural language stating the same meaning in various ways. Although many text entailment recognition models have been proposed,the recognition accuracy rate is not satisfactory due to the complex factors in the text entailment.Treating the text entailment as a binary classification problem, this paper extracts multiple features of lexical, syntactic dependencies and FrameNet semantic knowledge to train a SVM classifiers for the text entailment recognition. Evaluated by the international RTE3 test set,this method achieves 78.1% precisionin in positive entailments,which is higher than the best result of RTE3.

Select

Information Extraction and Text Mining

Feature Selection for Skewed Text Categorization

LIU Zhenyan, MENG Dan, WANG Weiping, WANG Yong

2014, 28(2): 116-121.

Abstract ( ) PDF ( )

Knowledge map

Save

The existing for feature selection methods are not appropriate for the skewed corpus in which most of samples belong to a majority class and far fewer samples belong to a minority class. The reason is that these methods select features without considering the relative distribution of each class. As a result, most of selected features using these methods come from the majority class, which tend to misclassify minority class samples. This paper analyzes the characters of the skewed corpus and finds two important factors which can influence feature selection on the skewed data: category distribution and category difference. The category distribution factor indicates category frequency difference in whole dataset, and the category difference factor indicates relative documents frequency difference between classes. Then a new feature selection function called Relative Category Difference (RCD) is constructed based on the two factors. Experimental results show that the new feature selection method outperforms other methods for the skewed text categorization.

Select

Information Retrieval and Social Computing

CICF: A Context Information Based Collaborative Filtering Algorithm

2014, 28(2): 122-128.

Abstract ( ) PDF ( )

Knowledge map

Save

Collaborative Filtering (CF) could satisfy users preferences and provide personalized guidance. As the key techniques in current Internet recommendation engines, however, this technology suffers from severe sparse users ratings problem. Considering the plenty context information in users rating histories, this paper utilizes two kinds of context information to address sparsity issue: the effect of hierarchical structure on users potential preferences and the dynamic effect of users short term ratings. An integrated model CICF is then proposed based on the two of the features mentioned above. Experimental results on Yahoo! Music ratings show that CICF could significantly improve the predication performance compared to baseline method. Furthermore, it is also demonstrated that CICF could effectively mitigate rating sparsity issue.

Select

Information Retrieval and Social Computing

LDA-CF: A Mixture Model for Collaborative Filtering

LIAN Tao, MA Jun, WANG Shuaiqiang, CUI Chaoran

2014, 28(2): 129-135.

Abstract ( ) PDF ( )

Knowledge map

Save

Recommender system is an important tool to overcome information overload, where the most popular approach is collaborative filtering. This paper presents a mixture model for collaborative filtering named LDA-CF, which combines latent factor models and neighborhood methods. Firstly we convert the ratings matrix into a collection of pseudo-documents and utilize the LDA topic model to identify user and item latent factor vectors. Then we compute user-item similarities in the low-dimensional latent factor space. Finally we employ the neighborhood methods to predict unobserved ratings. Experiments on MovieLens 100k dataset demonstrate that LDA-CF outperformed neighborhood methods on the task of rating prediction in terms of MAE.

Select

Information Retrieval and Social Computing

Microblog Retrieval via Author Based Microblog Expansion

LI Rui, WANG Bin

2014, 28(2): 136-143.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, the development of the microblogging is impressive.The microblogging retrieval has become an important research topic.Microblog texts are short, quick updated, and circulated overthe social network, which makes themicroblogging search different from the traditional web search. In this paper, we first comparethe traditional vector space model, probabilistic model and the basic language model in microblog search.Thenwe proposeto expand the microblog textvia the author informationto improve the retrieval. Asfor the issue caused by theshort document occurred in the topic model training, we usethe author’s topic model to further extend the content of microblogging. Tested on the twitter data set,the results show thatthe proposed author modelcan improve the retrieval effects in microblogging search task.

Select

Information Retrieval and Social Computing

Search Behavior Study Based on the Mobile SearchLog

WAN Fei, ZHAO Xi, LIANG Xun, PAN Deng, NI Zhihao

2014, 28(2): 144-150.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of mobile web, the mobile search engine users has been growing sharply. It’s of great significance to analysis users’ behavior toimprovethe mobile search engine and users’satisfaction. This paper selects the log of a mobile search engine in the first week of June 2011, and analyzes the mobile search engine user behavior. From perspectives of query word,session and clicks,we examinelength and frequency of query words,ratio of question query and URL query, number of queries in a session, modification of query words and click distribution. Furthermore, we compareresults of mobile search engine with internet search engine. These findingsare of substantial significance for the improvement and optimization of the mobile search engine.

Select

Information Retrieval and Social Computing

A Survey of Conversion-based Internet Advertising Model

GU Zhiyu , QIN Tao, WANG Bin

2014, 28(2): 151-158.

Abstract ( ) PDF ( )

Knowledge map

Save

The conversion-based advertising,which evaluateseffectiveness of an advertisement and chargesaccording to conversion occurred after a user viewed theadvertisement, leverages the unique power of Internet Advertising, and becomes the trend for future development of Internet Advertising. This paper introduces the scheme of the conversion-based advertising, analyzes its industrial application, and summarizes the researches on this field, including auction mechanism for CPA advertising, conversion rate estimation, conversion-based ad ranking, etc. Finally we analyze the existing problem and present the directions for further study.

Please choose a citation manager

Content to export

2014 Volume 28 Issue 2 Published: 10 February 2014