Journal of Chinese Information Processing

Select

Language Analysis and Cognitive Computation

How to Use Qualia Structure to Solve “Tennis Problem”？

YUAN Yulin, LI Qiang

2014, 28(5): 1-12.

Abstract ( ) PDF ( )

Knowledge map

Save

“Tennis Problem” tries to link racquet, ball and net that are in a situational association relationship, and find the semantic and reasoning relationship among them. This is a worldwide problem in natural language processing in the construction of related language knowledge or resources. Aiming at solving the “tennis problem”, this paper reviews several mainstream systems of linguistic lexical and conceptual knowledge base (including WordNet, VerbNet, FrameNet, ConceptNet, etc.), illustrates their limitations on solving this problem, and explains on why they cannot solve it. This paper furcher proposes that the descriptive system of knowledge based on the theory of generative lexicon, i.e., the qualia structure of nouns, can be adopted, and that the qualia structure and relevant syntactic combinations can be used to build a noun-based or entity-based lexical network. This conceptual network may make up for the inadequacy of the above-mentioned systems of knowledge base and provide a knowledge base of lexical concepts for natural language processing.

Select

Language Analysis and Cognitive Computation

Study on Brain-Computer Interface Based on Chinese Phoneme Imagery

YANG Xiaofang, JIANG Minghu

2014, 28(5): 13-23.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper intends to construct the framework of a voluntary speech neural prosthesis based on phoneme imagery EEG signals to make brain-computer interface (BCI) speech production more natural and fluent. EEG signals are recorded in three healthy subjects while they are imagining both the vocalization and places of articulation of four vowels and four consonants in Mandarin Chinese as well as a no imagination state as control. To process the EEG data, this study performs spectral, temporal, and spatial analyses to extract the optimal phoneme imagery features for pairwise classification by SVM between every two tasks. The results reveale that the phoneme imagery effect is demonstrated in the frequency range of 2~10Hz, the time interval of 300~500ms after the stimuli onset, and the spatial patterns with strong activities mainly covering the sensorimotor cortical region. Besides, this study also find a high correlation of the pairwise classification accuracies with the Jaccard distances between the experimental stimuli based on the binary descriptions of articulation control. This experiment confirms the hypothesis that phoneme imagery can be characterized as a complex motor imagery task and that, with the highest classification accuracy between speech imagery and non-imagery tasks reaching up to 83% averaged across subjects, scalp-level speech motor imagery signals probably possess an unfulfilled potential to control a neural utterance synthesizer for communication BCIs.

Select

Language Analysis and Cognitive Computation

A Comparative Study on the Language Networks Based on Co-occurrence, Syntax, Semantics

ZHAO Yiyi, LIU Haitao

2014, 28(5): 24-31.

Abstract ( ) PDF ( )

Knowledge map

Save

Network structure has been wildely applied in language studies with the coming of the big data era. Since language is a multi-level system of symbols, different language units will exhibit networks of different structure and function. This paper surveys the construction methods for the word co-occurrence network (on the basis of the adjacency of words), the syntactic network (on the basis of syntactic theory-dependency grammar) and the semantic network (on the basis of conceptual relation) for the same text. It is revealed that the syntactic network's diameter and average path length are much smaller than those of the co-occurrence network, and the content words in the semantic network occupy central node locations. This suggests that the linguistic theory is to be applied in the network analysis, and will contribute to better explain the differences of various language networks.

Select

Morphological, Syntactic and Semantic Analysis/Application

Identification of Chinese Prosodic Phrase Based on Chunk and CRF

QIAN Yili , FENG Zhiru

2014, 28(5): 32-38.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a Chinese prosodic phrase prediction method is proposed based on CRF model over Chinese Chunk which reflects shallow syntactic information. The Chunk definition and its tagging algorithm is first described, and thenthe CRF is applied over the Chunk annotated corpus to predict prosodic phrase boundary. The experimental results show that, after labeling the structure of Chunk, the F-sore of the CRFs model for prosodic phrase identification increases nearly 10%.

Select

Morphological, Syntactic and Semantic Analysis/Application

A Word Structure Analysis by Extending the Word Tag Set

SUN Jing, FANG Yan, DING Bin, ZHOU Guodong

2014, 28(5): 39-45.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a different way of lexical analysis, to analyze the internal structures of words, and presents a word structure analysis method by extending the word tag set. First, we describe the characteristics of the internal structures of words, By treating the prefixes and suffixes within words structures as special words, we identify the internal structures of words through the detection of prefixes and suffixes. We convert the issue of identifying the internal structures of words into the sequence tagging problem, adopting the CRF model to realize the words structures analysis using extending the word tag set. The experiment shows that they achieve higher accuracy both on overall performance and on the identification of each layer structure.

Select

Morphological, Syntactic and Semantic Analysis/Application

Quantize the Domain Property of Words

LIU Dongming, YANG Erhong

2014, 28(5): 46-50.

Abstract ( ) PDF ( )

Knowledge map

Save

Word, as the smallest semantic unit, has complex relationship with text domains. Especially, it is often difficult to define the exact domain for the commonly used words. In fact, it is not always necessary to establish clear relationship between the word and the domain for real applications. Instead, we can achieve satisfactory results by quantifying the domain property of the words. In this paper, we propose an unsupervised method for quantifying the domain property of words, based on word association information in the large-scale corpus. We valide the proposed value of words domain property by comparing against the classical TF * IDF measure in the topic detection application.

Select

Morphological, Syntactic and Semantic Analysis/Application

Acronym Term Disambiguation Based on Semantic Similarity Calculated by Word Embedding

YU Dong, XUN Endong

2014, 28(5): 51-59.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces a knowledge based unsupervised method for acronym term disambiguation. Word embedding is used for acronym term semantic representation. In the first stage of disambiguation, significantly similar documents are clustered and used as training data. Each cluster corresponds to an interpretation of an acronym term, so it can be seen as a semantic tag. Then the word embedding is trained for several times and semantic relation between two words can be calculated by average of cosine similarity of their vectors. In the second stage, the paper proposes to use feature word expansion and linear weighted semantic similarity to improve system performance. By calculating semantic similarities between documents and interpretations, implicit semantics can be mined as new feature words; and the feature words are linearly weighted by their semantic similarities with specific interpretation. Experimental results on 25 acronym terms show that, feature word expansion improves system F score by 4% and semantic weight gains higher performance by 2%, which yielding a final system F score of 89.40%.

Select

Morphological, Syntactic and Semantic Analysis/Application

Concept Acquisition Based on Chinese Classifier Words

WANG Meng,YU Shiwen

2014, 28(5): 60-65.

Abstract ( ) PDF ( )

Knowledge map

Save

Concept acquisition from corpora has become increasingly important in NLP. This paper presents a new concept representation based on classifier words. Concepts are modeled as vectors with one component corresponding to each classifier word. We propose a weighting scheme that assigns each classifier word a weight in a concept. Then we conduct experiments to identify concept similarities via clustering, and the results show classifier words can categorize most concept classes.

Select

Morphological, Syntactic and Semantic Analysis/Application

Research on Chinese Selectional Preferences Acquisition

JIA Yuxiang, WANG Haoshi, ZAN Hongying, YU Shiwen, WANG Zhimin

2014, 28(5): 66-73.

Abstract ( ) PDF ( )

Knowledge map

Save

Selectional preference describes the semantic preference of the predicate for its arguments. It is an important lexical knowledge which can be applied to syntactic and semantic analysis of natural languages. This paper studies the automatic acquisition of Chinese selectional preferences and proposes a HowNet based method and a LDA (Latent Dirichlet Allocation) based method. A comparative study shows that the former method acquires better understood knowledge while the latter achieves better performance in application. The two methods are complementary and mayoe combineal in process.

Select

Language Resources Construction

The Construction of the Chinese Semantic Orientation Corpus

YANG Jiang, LI Wei, PENG Shiyu

2014, 28(5): 74-82.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper introduces the construction of a Chinese Semantic Orientation Corpus (CSOC) by presenting its research background, design plan, annotating system and processing steps. The CSOC is an unbalanced synchronic monolingual corpus for the purpose of researching linguistic subjective expressions. Shipped with a concordancer, retrievial and visualization toolkit, the one million Chinese character corpus is specially designed according to a multi-dimensional descriptive system of linguistic subjectivity. It is characterized by its high-quality, linguistic motivation and double usability for both linguistics and natural language processing.

Select

Language Resources Construction

The Construction of an Emotion Annotated Corpus on Microblog Text

YAO Yuanlin, WANG Shuwei, XU Ruifeng, LIU Bin, GUI Lin, LU Qin, WANG Xiaolong

2014, 28(5): 83-91.

Abstract ( ) PDF ( )

Knowledge map

Save

The research on text emotion analysis has made substantial progesses in recent years. However, the emotion annotated corpus is less developed, especially the ones on micro-blog text. To support the analysis on the emotion expression in Chinese micro-blog text and the evaluation of the emotion classification algorithms, an emotion annotated corpus on Chinese micro-blog text is designed and constructed. Based on the observation and analysis on the emotion expression in micro-blog text, a set of emotion annotation specification is developed. Following this specification, the emotion annotation on micro-blog level is firstly performed. The annotated information includes whether the micro-blog text has emotion expression and the emotion categories corresponding to the micro-blog with emotion expressions. Next, the sentence-level annotation is conducted. Meanwhile, the annotation on whether the sentence has emotion expression and the emotion categories, the strength corresponding to each emotion category is annotated. Currently, this emotion annotated corpus consists of 14000 micro-blogs, totaling 45431 sentences. This corpus was used as the standard resource in the NLP&CC2013 Chinese micro-blog emotion analysis evaluation, facilitating the research on emotion analysis to a great extent.

Select

Language Resources Construction

Construction of Aviation Terminology Semantic Knowledge Base Based on HowNet

ZHANG Guiping, DIAO Lina, WANG Peiyan

2014, 28(5): 92-101.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic knowledge base construction is essential to natural language processing. But it is difficult to construct such knowledge base for specific domains. Based on the analysis of basic characteristics of the aviation terminology, this paper builds an aviation terminology semantic knowledge base according to HowNet and KDML, summarizing the foundation rules, Event Role/Features selection rules in this process. Finally, we demonstrate the validness of the constructed knowledge base by good results in term similarity calcualtion.

Select

Discourse Analysis

Topic-Chain-Based Coherence Annotation Scheme for Chinese Text

ZHOU Qiang, ZHOU Xiaocong

2014, 28(5): 102-111.

Abstract ( ) PDF ( )

Knowledge map

Save

There are few explicit discourse connectives in Chinese texts, which bring in new challenge for the traditional connective-grounded coherence annotation scheme. The paper proposes a new idea to deal with the problem. We introduce topic chain (TC) as a main coherence representation and design several topic-comment relations to describe the complex event relations among TC-linked sentences. Therefore, a new coherence annotation scheme based on TCs and connectives are built accordingly. The tentative confirmatory experiments on the Tsinghua Chinese Treebank (TCT) data set show that more than 76% and 50% Chinese complex sentences have TCs and connectives respectively. They can co-occur in most Chinese sentences. The phenomena verify the feasibility and availability of this scheme.

Select

Discourse Analysis

Cognitive Complexity of Topic in Chinese Text Based on Generalized Topic Structure Theory

LU Dawei, SONG Rou, SHANG Ying

2014, 28(5): 112-124.

Abstract ( ) PDF ( )

Knowledge map

Save

Despite the substantial researches on language understanding from cognitive perspective, Chinese language remains a less touched issue. Since cognitive experiments are too complicated to replicate on a large scale, it is difficult to quantify their generalizability and the degree of coverage in all language facts. We construct a Generalized Topic Structure Cognitron (GTSC) by simulating human's cognitive process on complementing the topic-comment information of Chinese Punctuation Clauses (P-clause). With quantitative analysis of large-scale Chinese texts by the way of GTSC, we study the required human memory resources and the cognitive limitation in P-clause understanding. The features adopted in Generalized Topic Structure analysis include the depth of P-clause, the returning degree within topic structure, depth of topic stack, returning degree of topic stack and the number of lay-down area. The statistical results of Generalized Topic Structure produced by GTSC can be explained reasonably from cognitive perspective. This paper reveals the cognitive ability and limitation of Chinese people in topic-comment information processing. In addition, it proves that the GTSC is a reasonable model for cognitive processing of topics in Chinese language.

Select

Machine Translation

Research on Word Deletion Issue in Statistical Machine Translation

LI Qiang, HE Yanlong, LUAN Shuang, XIAO Tong, ZHU Jingbo

2014, 28(5): 125-132.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper addresses the word deletion issue in phrase-based machine translation. After accounting word deletion errors for three causes from the persective of human reading, we propose to introduce constraints on unaligned words of source language in phrase extraction to deal with this issue. Two methods are presented for the design of the constraints, including a frequency-based method and a part-of-speech-based method. Automatic and human evaluations demonstrate promising improvements in translation quality on both the Chinese-to-English and the English-to-Chinese translation tasks, on the basis of a more compact phrase tables.

Select

Machine Translation

Chunk-based Dependency-to-String Model Using Case Frame for Japanese-Chinese Statistical Machine Translation

WU Peihao, XU Jinan, XIE Jun, ZHANG Yujie

2014, 28(5): 133-140.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a method to integrate case frame into Japanese to Chinese chunk-based dependency-to-string model. Firstly, case frames are acquired from Japanese chunk-based dependency analysis results. Secondly, case frames are used to constrain the rule extraction and the decoding in chunk-based dependency-to-string model. Experimental results show that the proposed method performs well on long structural reordering and lexical translation, and achieves better performance than hierarchical phrase-based model and word-based dependency-to-string model on Japanese to Chinese test sets.

Select

Machine Translation

Translation Rule Pruning and Model Training with Semi-Forced Decoding and Variational Bayesian Inference

GAO Enting, DUAN Xiangyu, CHAO Jiayuan, ZHANG Min

2014, 28(5): 141-147.

Abstract ( ) PDF ( )

Knowledge map

Save

SMT usually learns translation models with heuristics, which leads to large models and potentially less accurate model parameters due to the poor theoretical justification of heuristics. This paper presents a variational Bayesian inference-based training method to address these two issues, targeting to learn a compact translation model with more accurate translation probabilities. It is achieved by translation model parameter estimation using variational Bayesian EM over alignments obtained by forced decoding. Experimental results on the Chinese-English NIST translation data shows that our proposed method is very effective, resulting in more than 95% (76%) rule pruned out with significant performance improvement in Bleu score for syntax-based SMT and phrase-based SMT.

Select

Sentiment Analysis and Social Computation

Recognition of Microblog Customer Opinion Sentences in Automobiles Domain

PAN Yanxi, YAO Tianfang

2014, 28(5): 148-154.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper investigates how to automatically recognize the customer opinions towards certain automobiles in microblogs. Since there are a lot of advertises and release information of cars in microblogs, customer-generated opinion sentences are sparse, this paper proposes a SVM classifier-based method to combine microblog data and car review data for training. The selected features include words, the number of opinion words, words that have certain relations with opinion targets, as well as microblog-related features such as emoticons and user type. Experiment results indicate that opinion words feature and some of the microblog-related features boost the performance of the classifier. In addition, the performance of the classifier that uses two kinds of data for training is better than the one that only uses microblog data.

Select

Sentiment Analysis and Social Computation

Deep Learning for Chinese Micro-blog Sentiment Analysis

LIANG Jun, CHAI Yumei, YUAN Huibin, ZAN Hongying, LIU Ming

2014, 28(5): 155-161.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese micro-blog sentiment analysis aims to discover the user attitude towards hot events. Most of the current studies analyze the micro-blog sentiment by traditional algorithms such as SVM, CRF based on hand-engineered features. This paper explores the feasibility of performing Chinese micro-blog sentiment analysis by deep learning. We try to avoid task-specific features, and use recursive neural networks to discover relevant features to the tasks. We propose a novel model - sentiment polarity transition model - based on the relationship between neighboring words of a sentence to strengthen the text association. The proposed method achieves a performance close to state-of-the-art methods based on the hand-engineered features, but saving a lot of manual annotation work.

Select

Other Languages in/around China

Phrase Structure Prasing of Mongolian

Wu lan,Dabhurbayar,GUAN Xiaoda,ZHOU Qiang

2014, 28(5): 162-169.

Abstract ( ) PDF ( )

Knowledge map

Save

Syntactic analysis is in a very critical position in natural language processing. This paper summarizes the difficulties in Mongolian phrase sturucture analysis. Then it develops a phrase annotation scheme according to the characteristics of Mongolian, and constructs a Mongolian phrase-structure Treebank. The Mongolian parser trained on the corpus achieves and accuracy of 62%.

Select

Other Languages in/around China

Tibetan Chunking Based on Error-Driven Learning Strategy

WANG Tianhang, SHI Shumin, LONG Congjun, HUANG Heyan, LI Lin

2014, 28(5): 170-175.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan chunking is aimed at identifying syntactic constituent in Tibetan sentences to facilitate further analysis of sentences. According to the unique characteristics of Tibetan, the paper puts forward an error-driven learning strategy to identify the chunk boundary based on the description system of Tibetan syntactic functional chunk. The specific idea is as follows: we recognize the chunk boundary using the Conditional Random Fields (CRFs) model at first. Then the recognition result is refined through Transformation-based Error-driven Learning (TBL) method and the CRFs error-driven method. The F values of both methods increase 1.65% and 8.36%, respectively. Finally we combine these two error-driven techniques. In the experiment of the Tibetan corpus which contains 18073 words, the precision, recall and F value achieves 94.1%,94.76% and 94.43%, respectively.

Select

Other Languages in/around China

Multi-Strategy Semantic Role Labeling of Tibetan

LONG Congjun, KANG Caijun, LI Lin, JIANG Di

2014, 28(5): 176-181.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic role labeling is of great significance for natural language processing. Substantial achievements have been made in this issue for both English and Chinese. However, either the resource construction or the technology development for semantic role labeling in Tibetan is still in the initial stage. Tibetan has rich syntactic markers which naturally segment a sentence into different semantic chunks, and there are certain relationship between these chunks and semantic roles. Accordingly, this paper propose a semantic role labeling strategy for Tibetan based on semantic chunking by combining two means of rules and statistics. In order to realize the semantic role labeling, a classification system of Tibetan semantic roles is designed and then the acquisition of rules is discussed, including a manual initial rule sets and expanded rule sets from Transformation-Based Error-driven Learning (TBL). Meanwhile the Conditional Random Fields (CRFs) Model is adopted for statistical decision. Experimental results shows that the proposed semantic role labeling method achieves 82.78% in precision, 85.71% in recall, and 83.91% in F measure.

Select

Other Languages in/around China

Study on Uyghur Speech Retrieval

ZHANG Liwen, Nurmemet Yolwas, Wushour Silamu

2014, 28(5): 182-186.

Abstract ( ) PDF ( )

Knowledge map

Save

Facing with the age of big data, it is of great importance to locate key sensitive information from various audio and video that are ever-increasing. Although such teachnology named speech retrieval technology has been well addressed in Chinese and English,the Uyghur speech retrieval technology is still in its infancy. This paper investigates this issue and establishes a Uyghur speech retrieval system by using such technologies as of the large vocabulary continuous speech recognition, the confusion network for latice, the inverted index, and relevance estimation. Experimental results show that at the level of 82.1% accuracy rate for speech recognition,the system recall reaches 97.0% and 79.1%,with the false alarm rates of 13.5% and 8.5%, respectively.

Select

Other Languages in/around China

A Perceptron Approach to Uyghur POS Tagging

Patigul Imam, Maihemuti Maimaiti, Turgun Ibrayim, Kaharjan Abdurixit

2014, 28(5): 187-191.

Abstract ( ) PDF ( )

Knowledge map

Save

Uyghur POS tagging is essential for subsequent tasks such as Uyghur sentence analysis, semantic analysis and discourse analysis. In this paper, perceptron training algorithm and viterbi algorithm are used for Uyghur POS tagging, and the context information of the words are employed. Experiment results show that this method has good results for the Uyghur POS tagging.

Select

Other Languages in/around China

Research on Key Technology for Statistics of Modern Uyghur Language

Azragul, Nurahmat, Yusup Abaydula

2014, 28(5): 192-197.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper studies key technologies for the modern Uyghur language corpus construction, in particular the collection of modern Uyghur language corpus, and the pre-processing of modern Uyghur corpus, the statistical technique in modern Uyghur corpus, the stemming of modern Uyghur and the analysis of modern Uyghur data. To develope a candidate list for modern Uyghur common words, this paper examines the words in two aspects: the frequency and distribution, specifically including the word species, frequency , frequency rate, document coverage word length.

Select

Other Languages in/around China

Rule-based Recognition of Vietnamese Named Entities

YAN Danhui, BI Yude

2014, 28(5): 198-205.

Abstract ( ) PDF ( )

Knowledge map

Save

Named Entity Recognition (NER) is an important task for Information Extraction. NER mainly includes the recognition of person names, location names and organization names. Studies on English and Chinese NER began relatively earlier, mainly using rule-based methods or statistical methods. There are fewer studies carried out on Vietnamese NER, and there are even no domestic studies. This paper presents a rule based method to recognize Vietnamese Named Entities on the basis of their linguistic formations. Experiments results validate the effectiveness of this method.

Select

Information Extraction and Text Mining

Detecting Event Relation through Cross-Scenario Inference

YANG Xuerong, HONG Yu, CHEN Yadong, WANG Xiaobin, YAO Jianmin, ZHU Qiaoming

2014, 28(5): 206-214.

Abstract ( ) PDF ( )

Knowledge map

Save

Event relation detection aims to detect a logical relation between pairwise events. The key to event relation detection is to detect latent logical relation between events by analyzing the corresponding discourse structure and semantic features of events, with the techniques of semantic relation recognition and inference. In this paper, we build an overall framework of the event relation detection task, including task definition, relation hierarchical structure, corpora acquisition and evaluation. Meanwhile, we propose a cross-scenario inference method to predict relation between pairwise events, which follows the basic hypothesis that if events express the same scenarios, they normally trigger similar relations. Finally, we experiment on four general semantic relations, Expansion, Comparison, Contingency and Temporal, achieving an accuracy of 54.21%.

Select

Information Extraction and Text Mining

Adding Colon and Semicolon Label Feature to Chinese Comma Classification

LI Yancui, GU Jingjing, ZHOU Guodong

2014, 28(5): 215-222.

Abstract ( ) PDF ( )

Knowledge map

Save

Punctuation analysis plays an important role in sentence and discourse analysis, in which the functional classification of the comma is the key and most challenging issue. This paper explores Chinese comma automatic classification by adding the classification labels of Chinese colon or semicolon as new features. First, we describe the classification method of comma, colon and semi-colon. Then the corpora of comma, colon and semicolon are introduced. Finally, we investigate comma classification results by adding Chinese colon and semicolon, respectively and jointly as new feature(s). Experimental results show that the accuracy of comma classification improves in all three cases.

Please choose a citation manager

Content to export

2014 Volume 28 Issue 5 Published: 10 May 2014