Journal of Chinese Information Processing

Select

Survey

A Survey of Coreference Resolution Research Methods

SONG Yang, WANG Houfeng

2015, 29(1): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

Coreference resolution, as a challenging issue, has been noted by NLP researchers for a long time. In recent twenty years, many kinds of advanced NLP techniques have been applied on this problem, and some of them have achieved significant improvements. In this paper, we first introduce some basic concepts and formalized this isuse. Then we summarize different research strategies adopted by researchers in recent decades. We highlight the feature engineering, which lies in the core of coreference resolution. Finally we describe the recent evaluations for this task and discusssome key issues and prospects in the future.

Select

Survey

Chinese Compound Sentences Processing: Past 20 Years

WU Fengwen

2015, 29(1): 13-18.

Abstract ( ) PDF ( )

Knowledge map

Save

The study on Chinese Compound Sentences is essential to the information processing. This paper summarizes the past researches on compound sentences, including compound sentences modeling, relation markers recognition, structure recognition, compound sentences parsing and corpus construction. It also reveals the prospects and possible research trends in further studies.

Select

Language Analysis and Language Resources Construction

A Corpus-Based Study on Personal Names and Terms of Address in Chinese Classical Novels

XIONG Dan, LU Qin, LUO Fengzhu, SHI Dingxu, ZHAO Tiancheng

2015, 29(1): 19-27.

Abstract ( ) PDF ( )

Knowledge map

Save

Personal names and terms of address are important parts of named entities. The recognition of personal names as well as terms of address is ans essential issue in natural language processing. This paper presents a classification and annotation scheme for personal names and terms of address from the perspective of named entity recognition and information extraction on a corpus of four Chinese classical novels. Personal names and terms of address are categorized into simple types and compound types. And the compound-type is further categorized into four subtypes, fixed expressions, appositive constructions, subordinate constructions of affiliation, and other subordinate constructions. This paper also presents a comparative analysis on these types and the characteristics of the four novels based on full statistics of the annotated corpus.

Select

Language Analysis and Language Resources Construction

Towards Data Structure Analysis of Half-Returned Feature of Understanding Garden Path Phenomenon

DU Jiali, YU Pingfang

2015, 29(1): 28-37.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses data structure of garden path phenomenon (GPP). The data structure of GPP belongs to cognitive tree-liked structure rather than the other structures, e.g. word set structure in pre-grammar condition, linear grammatical structure in syntactic understanding, and ambiguous map-liked structure in semantic-matched multiple cognition. The distinctive structure features of GPP include. (1) In the early understanding, the data structure of GPP shows a linear feature; (2) in the medium-term understanding, semantic trigger point brings the breakdown of the original model, and the data structure of GPP is a word set structure; (3) in the late understanding, processing breakdown results in backtracking and GPP creates a tree-liked data structure at the end; (4)the dynamic understanding of GPP is the integration of two structures except map-liked one, and the activation of semantic trigger point brings additional cognitive load. The difference between tree-liked data structure of GPP and map-liked data structure of ambiguity reflects the dissimilarity between these two syntactic phenomena from the perspective of data structure, which provides the theoretical support for computational linguistics to interpret GPP.

Select

Language Analysis and Language Resources Construction

An Analysis of Discourse Rhetorical Structure Influence on Focus Distribution

ZHAO Jianjun ,YANG Xiaohong, YANG Yufang

2015, 29(1): 38-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on 30 narrative texts of mandarin Chinese with the sentence focus annotated by 20 subjects, a statistical analysis is conducted to examine the influence of discourse rhetorical structure on focus distribution. The result shows that about 30% of the sentences in the narrative discourse have no focus. It is further revealed that the nuclearity has remarkable influence on focus distribution: about 80% of the nucleus sentences had focus but only 60% of the satellite sentences had focus. The sentences of the highest hierarchy have less focus. The narrative discourses consist of ten main rhetorical relations, in which the conjunction relation and the elaboration relation have the most sentences with focus and the attribution relation has the least sentences with focus.

Select

Language Analysis and Language Resources Construction

Rule Based Identification of Compound Sentences Relation Words

JIA Suimin, LEI Lili, HU Mingsheng

2015, 29(1): 44-48.

Abstract ( )

Knowledge map

Save

Automatic identifying the relation words of compound sentences is a fundamental issue in the field of Chinese information processing. This paper describe a rule based method for automatic identification of compound sentence relation words. To construct the rule, 12 featuresare summarized from the corpus. Then a match algorithm is described to obtaind the candidate relation word sequence. Finally the context of the relation words is employed to match with the rules. Experiment results show that this method achieves an accuracy of 70.9%.

Select

Language Analysis and Language Resources Construction

An Acoustic Study of Nasalized Vowel in Nasal Coda Syllables

SUN Ruixin

2015, 29(1): 49-56.

Abstract ( )

Knowledge map

Save

The vowel in a nasal coda syllable will become a nasalized one. The issue is how to measure the degree of being nasalized. This paper puts forward a method based on the bandwidth of formants and the duration of nasalized part of the vowel after a deep acoustic analysis of the speech sound. We find that the nasalized degrees of vowels in alveolar nasal syllables are less than that of vowels in velar nasal syllables. The degree of the former is 0.410 and the latter is 0.718. The top degree lies in the high vowels, which are easy to be nasalized.

Select

Language Analysis and Language Resources Construction

Construction of Information Extraction-orientated Chinese Cross Document Coreference Corpus

ZHAO Zhiwei, QIAN Longhua, ZHOU Guodong

2015, 29(1): 57-66.

Abstract ( ) PDF ( )

Knowledge map

Save

Cross Document Coreference(CDC) resolution is an important step in information integration and information fusion. As a consequence, a CDC corpus is indispensable for research and evaluation of CDC resolution. Given the fact that no Chinese CDC corpus is publicly available oriented for information extraction, this paper describes how to build a CDC corpus based on the ACE2005 Chinese corpus via automatic generation and manual annotation, which covers all the ACE entity types. The corpus is made publicly available to advance the research on Chinese CDC resolution. In addition, this paper analyses the types and characteristics of CDC in Chinese text as well as proposes the concept of two metrics, i.e., “variation perplexity” and “ambiguity perplexity”, to evaluate the difficulty of Chinese CDC resolution, providing some insights for further CDC research.

Select

Machine Translation

A Bilingual Chunk Alignment Algorithm for Computer Aided Translation

YU Jingsong, WANG Huilin, WU Shenglan

2015, 29(1): 67-74.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic Bilingual Chunk Alignment has important application value for Machine Translation, Computer Aided Translation and other fields. In this paper, a Chunk Partition Scoring method is proposed based on the Degree of Adhesion and the Degree of Relaxation to make the chunk partition of source language and target language benefit each other. A novel bilingual chunk alignment algorithm is proposed. Compared with previouswork, this algorithm does not require bilingual chunk partitions, however, the chunk partition score is dynamically calculated during alignment searching. The importance of precision is far beyond recall of this approach.

Select

Machine Translation

Construction of Chinese Sentence-Category Dependency Treebank and Its Application

WANG Huilan, ZHANG Keliang

2015, 29(1): 75-81.

Abstract ( ) PDF ( )

Knowledge map

Save

Aimed at the application in Machine translation, this paper conducts a research on the construction of Chinese Sentence-Category Dependency Treebank (CSCDT) based on the theory of Hierarchical Network of Concepts (HNC). The conceptual category tagset and the Sentence-Category relation tagset for the treebank are presented together with the example tree of CSCDT. Compared with other Chinese treebanks, this paper discusses two advantages of CSCDT. In addition, the translation template of Sentence-Category dependency subtree to string are defined to construct translation template library for Chinese-English machine translation.

Select

Information Extraction and Text Mining

Term Extraction Based on Information Entropy and Word Frequency Distribution Variety

LI Lishuang, WANG Yiwen, HUANG Degen

2015, 29(1): 82-87.

Abstract ( ) PDF ( )

Knowledge map

Save

A term extraction system based on information entropy and word frequency distribution variety is presented. Information entropy can measure the integrality of the terms while word frequency distribution variety can measure the domain relativity of terms. Incorporating with simple linguistic rules as an addition filter,the automatic term extraction system integrates information entropy into word frequency distribution variety formula. Preliminary experiment on the corpus of automotive domain indicates that the precision is 73.7% when 1,300 terms are extracted. The result shows that the proposed approach can effectively recognize the terms with lower frequency and the recognized terms are well of integrality.

Select

Information Extraction and Text Mining

Extracting Part-Whole Relations Based on Coordinate Structure

XIA Fei, CAO Xinyu, FU Jianhui, WANG Shi, CAO Cungen

2015, 29(1): 88-96.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatic discovery of part-whole relations from the Web is a fundamental but critical problem in knowledge engineering. This paper proposes a graph-based method of extracting part-whole relations from the Web. Firstly, we download snippets from Google using part-whole query patterns, and then we built a graph by extracting word pairs with a coordinate structure from these snippets, with the co-occurring words as nodes and the frequency count as edges’ weight. A hierarchical clustering method is used to cluster the correct parts, which is optimized by five methods of adjusting the edge weight: reduce the weight of comma-edges, cut the low-frequency edges, enlarge the weight of edges in the loop, enlarge the weight of edges in which two nodes share the same suffix, and enlarge the weight of edges in which two nodes share the same prefix. Experimental results show that the five methods increase the recall substantially.

Select

Information Extraction and Text Mining

Research on Extensible Web Key Information Extraction

GUO Shaohua , GUO Yan, LI Haiyan, LIU Yue, ZHANG Jin, CHENG Xueqi

2015, 29(1): 97-103.

Abstract ( ) PDF ( )

Knowledge map

Save

An extensible framework of web key information extraction is presented in this paper. This framework combine automatic information extraction algorithms and template detection algorithms, essentially improving the precision and efficiency of extraction. Some key parts of this framework can be replaced as required, therefore it has excellent extensibility. Furthermore, this paper also describes an orthogonal filter algorithm, which improves the precision of template generation. And the experiments provide positive results for this method.

Select

Information Extraction and Text Mining

A Tri-training Based Semi-supervised Multi-label Learning for Text Categorization

GAO Jiawei, LIANG Jiye ,LIU Yanglei,LI Ru

2015, 29(1): 104-110.

Abstract ( ) PDF ( )

Knowledge map

Save

Multi-label learning is proposed to deal with the ambiguity problem in which a single sample is associated with multiple concept labels simultaneously, while the semi-supervised multi-label learning is a new research direction in recent years. To further exploit the information of unlabeled samples, a semi-supervised multi-label learning algorithm based on Tri-training(MKSMLT) is proposed. It adopts ML-kNN algorithm to get more labeled samples, then employs the Tri-training algorithm to use three classifiers to rank the unlabeled samples. Experimental results illustrate that the proposed algorithm can effectively improve the classification performance.

Select

Information Extraction and Text Mining

Retrospective Topic Identification Model for Short Text Information Flow

ZHOU Hong, LIU Jinling , WANG Xingong

2015, 29(1): 111-117.

Abstract ( ) PDF ( )

Knowledge map

Save

In recent years, the short text information flow has occured in some public media. For this kind of data, a retrospective topic identification model is presented with an improved weight estimation. It employes the value of BIC for clustering to improve the clustering accuracy. By dividing the time segments and removing isolated information point, the efficiency of the algorithm is further improved. The experimental results show that this method achieves good accuracy and efficiency in the topic detection of the short text information flow.

Select

Information Extraction and Text Mining

Multi- granularity Topic Structure Modeling in Text Stream

CHEN Qian, GUO Xin, WANG Suge, ZHANG Hu

2015, 29(1): 118-125.

Abstract ( ) PDF ( )

Knowledge map

Save

Topic Detection has been widely used in text mining and NLP, while the basis of which is topic structure modeling. In this paper, we propose a semantic hierarchical topic structure model to describe multi-granularity topic structure. This model utilizes the characteristics of domain ontology, with each concept in the ontology mapped to a topic. The concepts in concept list are respresented as topic-tree leaf nodes, and nodes in each layer can be treated as multinomial mixture distribution on the lower layer nodes. This delicate structure is easily adapted to multi-granularity topic structure in real world text stream. Experiment showed that the structure model reflect rich multi-granularity semantic feature of topic.

Select

Information Extraction and Text Mining

Typed N-gram for Online SVM Based Chinese Spam Filtering

SHEN Yuanfu, SHEN Yuewu

2015, 29(1): 126-132.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we propose Mix-grams method to improve online SVM filter for spam filtering. Though online SVM classifier brings high performance on online spam filtering, its computational cost is remarkable compared to other methods such as Logistic Regression. In this paper, we propose a type based n-gram extraction method to reduce the feature dimension of online SVM filter. Experimental results demonstrate that the method improves the filter performance and reduces the computational cost of online SVM filter.

Select

Information Extraction and Text Mining

Reversible Text Watermarking Algorithm Based on Prediction Error Expansion

FEI Wenbin, TANG Xianghong, WANG Jing, LIN Xinjian

2015, 29(1): 133-138.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to avoid the permanent change of the text content caused by watermark embedding, this paper proposes a reversible watermarking algorithm for Chinese text document based on prediction error expansion englightened by the reversible watermarking for the image. Taking the sentence as the unit, The algorithm selects the words to be replaced according to the size of context collocation degree, and then realizes the embedding by the prediction error expansion and Chaos Sequence. Results show that this algorithm not only has the higher security, but also can extract watermark effectively while maintaining an exact restoration of the original text.

Select

Syntactic, Semantic Analysis and Social Computation

A Chinese Parsing Method Based on Interdependent and Structured Input and Output Spaces

ZHAO Guorong,WANG Wenjian

2015, 29(1): 139-145.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese syntax has complex structure and high dimension features, and the best known Chinese parsing performance is still inferior to that of other western languages. In order to improve the efficiency and accuracy of Chinese parsing,we propose a L2-norm soft margin optimization structural support vector machines (structural SVMs) approach. By constructing the structural function ψ(x,y), the input information of syntactic tree can be mapped well. Since Chinese syntax has a strong correlation, we use father node of phrase structure trees to enrich the structure information of ψ(x,y). The experiment results on the benchmark dataset of PCTB demonstrate that the proposed approach is effective and efficient compared with classical Structural SVMs and Berkeley Parser system.

Select

Syntactic, Semantic Analysis and Social Computation

Event Information Enhanced Question Semantic Representation for Chinese Question Answering System

WEI Chuyuan, ZHAN Qiang, FAN Xiaozhong, MAO Yu, ZHANG Dakui

2015, 29(1): 146-154.

Abstract ( ) PDF ( )

Knowledge map

Save

Question understanding of complex questions is a challenging issue in question answering system. For complex questions containing events (actions) information, this paper presents a question semantic representation (QSR) model based on semantic chunk. The semantic components of a complex question are labeled abstractly as the question focus, the question topic and the question event. A Semantic Structure of Question Event is then created to represent the semantic information of question event, including the question focus chunk, the question topic chunk and the question event chunk. To map the interrogative sentence into this question semantic representation, the Conditional Random Fields model is adopted for automatic semantic labeling of question semantic representation. The results show that automatic semantic labeling gains better performance.

Select

Syntactic, Semantic Analysis and Social Computation

Phrase-level Sentiment Analysis Approach Based on Yet Another CRFs

Odbal, WANG Zengfu

2015, 29(1): 155-162.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper treat the phrase-level sentiment analysis as a sequence annotation problem, and proposes an extension model of conditional random fields, YACRFs, to annotate sentiment orientation of phrases. In contrast to previous works focusing on linear-chain CRFs, which corresponds to nite-state machines wtih efficient exact inference algorithms,we wish to label sequence data in multiple interacting ways—for example, performing word based semantic orientations tagging and phrase-level sentiment analysis simultaneously, increasing joint accuracy by sharing information between them. The proposed model incorporates the word emotional orientation analysis process and the phrase analysis through the incorporation of the features of polarity words, phrase rules template as well as part of speech characteristics. Experiments shows the proposed model performs best with an accuracy of 81.07%. And applied the results in sentence-level sentiment analysis, it brings again the best accuracy of 94.30%.

Select

Syntactic, Semantic Analysis and Social Computation

Opinion Target and Polarity Extraction Based on Iterative Two-Stage CRF Model

ZHANG Sheng, LI Fang

2015, 29(1): 163-169.

Abstract ( ) PDF ( )

Knowledge map

Save

As a new media, Microblogging has been playing an indispensable role in people’s life. To extract sentimental information from the Microblogs, this paper introduces a two-stage CRF model and an iterative two-stage CRF model. The two-stage CRF model reaches an F-score of 0.505 on the COAE2014 evaluation data, and the iterative two-stage CRF model reaches an F-score up to 0.513 by an improvement in the recall.

Select

Other Language in/around China

Mining Tibetan Web Text Resources and Its Application

LIU Huidan, NUO Minghua, MA Longlong, WU Jian, HE Yeping

2015, 29(1): 170-177.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on link analysis and Tibetan encoding detection, this paper focuses on mining the Tibetan text resources over the internet with a crawler, and analyzes the distribution of Tibetan text. Statistical data shows that, more than 50% inland Tibetan web sites are hold by organizations in Qinghai province, and about 87% web pages belong to 31 large web sites. People prefer to use Unicode as the encoding of their new web pages rather than legacy encodings. It is practical to to extract Tibetan text from the pages with the natural tag information, such as HTML elements, column information and punctuations. The text can be used to build raw corpus, text classification corpus, and internet word/phrase corpus and so on. Word frequency statistics and language model can also be derived. In addition, some bilingual corpus can also be extracted.

Select

Other Language in/around China

Research on Mongolian Spoken Term Detection Based on Phoneme Confusion Network

BAO Feilong, GAO Guanglai, BAO Yulai

2015, 29(1): 178-182.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with Out-of-Vocabulary detection on Mongolian spoken term detection system, this paper proposes a Mongolian spoken term detection method based on phoneme confusion network.The Confidence Measure is improved by incorporating phoneme confusion matrix. Experimental results show that our method obtains a satisfying performance in the task of Mongolian Out-of-Vocabulary detection, with 6% improvement in precision rate and 2.69% in recall rate.

Select

Other Language in/around China

Study on Tibetan Web Community Search

CHEN Xinyi, XIA Jianhua, DU Yuxiang, WAN Fucheng, YU Hongzhi

2015, 29(1): 183-190.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper analyzes the degree distribution of Tibetan Web community and reveals the defects in the maximum degree-first search algorithm. It proposes a more efficient bisection degree search algorithm as well as a hybrid strategy of combining maximum degree and bisection degree search. According to Community division principle, this paper designs and realizes the search algorithm for Tibetan web community. The result shows that the proposed method are better than other search algorithms in terms of average search steps and average query informativeness.

Select

Other Language in/around China

Study on the Sorting Algorithm of Tibetan Dictionary

Bianba Wangdui, Drolkar, DONG Zhicheng, WU Qiang, WANG Longye

2015, 29(1): 191-196.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, a sorting algorithm for cotemporary Tibetan syllable is presented by Cartesian product on the basis of a definition of Tibetan component priotiry. This method conforms to the Tibetan morphology and syntax. Finally, all grammar rules that related the Tibetan syllable ‘

’ are tested and it proves that the algorithm meets the demands of the contemporary Tibetan dictionary.

Select

Other Language in/around China

Research on Slavic Mongolian Word Segmentation Based on Dictionary and Rule

SHI Jianguo ,HOU Hongxu, BAO Feilong

2015, 29(1): 197-202.

Abstract ( ) PDF ( )

Knowledge map

Save

Slavic Mongolian is the daily language in Mongolia, which is also known as Cyrillic Mongolian or new Mongolian. This paper explores the Slavic Mongolian word segmentation by combining the dictionary with rules. We first preprocess with the dictionary for the words of high-frequency or not consistent with rulesm then deal with the rest words with rules to generate n-best candidates for final decision We combine the two different methods, taking bothadvantages and achieving excellent performance in the Slavic Mongolian word segmentation.

Please choose a citation manager

Content to export

2015 Volume 29 Issue 1 Published: 10 January 2015