Journal of Chinese Information Processing

Select

Review

Embedding for Words and Word Senses Based on Human Annotated #br# Knowledge Base: A Case Study on HowNet

SUN Maosong ; CHEN Xinxiong

2016, 30(6): 1-6.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper aims to address the necessity and effectiveness of encoding a human annotated knowledge base into a neural network language model, using HowNet as a case study. Traditional word embedding is derived from neural network language model trained on a large-scale unlabeled text corpus, which suffers from the quality of resulting vectors of low frequent words is not satisfactory, and the sense vectors of polysemous words are not available. We propose neural network language models that can systematically learn embedding for all the semantic primitives defined in HowNet, and consequently, obtain word vectors, in particular for low frequent words, and word sense vectors in terms of the semantic primitive vectors. Preliminary experimental results show that our models can improve the performance in tasks of both word similarity and word sense disambiguation. It is suggested that the research on neural network language models incorporating human annotated knowledge bases would be a critical issue deserving our attention in the coming years.

Select

Review

Lexical Semantic Constraints on the Mapping from Semantic Roles to Syntactic Elements

KANG Shiyong;ZHANG Chen

2016, 30(6): 7-14.

Abstract ( ) PDF ( )

Knowledge map

Save

According to the linking theory and event structure theory, this paper focused on the study of correlation analysis of lexical semantic categories, semantic roles and syntactic elements. We annotate the text in the Chinese textbooks of primary and middle schools, published by Peoples Education Press, to build our annotated corpus. Based on this corpus, we analyzed the correlation between lexical semantic categories and semantic roles, and summarizing the correlation characteristics of each lexical semantic category. We hope this study could benefit the study of automatic syntactic parsing and semantic analysis.

Select

Review

A Visual Model for Fine Grained BI-structure Analysis and Its Application

PARK Minjun; YUAN Yulin

2016, 30(6): 15-25.

Abstract ( ) PDF ( )

Knowledge map

Save

In this study, a propositional representation model for Chinese BI(比)-structure is described. The model is based on 7 types of Comparative Elements (CEs), enhancing the resolution of existing 5 CE-based framework in analysis of comparatives. The model is also fully visualized by the relational structure of two propositional descriptions, based on which we reveal three basic patterns of comparison and define the standard of asymmetrical comparison explicitly. Consequently, it provides intuitive and easy way to analyze complex, multi-layer predications embedded in BI-structure, which are mostly elusive and tricky part of the comparative relation extraction. Moreover, the model is compatible with the OWL ontology language due to its basis of propositional logic. Accordingly, a small-scale ontology is built to demonstrate automatic relation extraction of BI-comparatives.

Select

Review

Lexical Knowledge Representation and Sense Prediction of Chinese Unknown Words

TIAN Yuanhe; LIU Yang

2016, 30(6): 26-34.

Abstract ( ) PDF ( )

Knowledge map

Save

In the previous researches in sense prediction of Chinese unknown words, the lexical knowledge related to word-formation has been used but not regarded as a valuable form of knowledge representation. This paper, on the basis of the morphemic concepts, provides a multi-level solution to knowledge representation of Chinese unknown words. A model based on Bayesian network is also constructed to analyze semantic word-formation of Chinese unknown words, effectively predicting the multi-level lexical knowledge of Chinese unknown words. This kind of lexical knowledge representation is simple, intuitive and easy to expand. Experimental results show that, this knowledge representation is of important value in sense guessing of Chinese unknown words, and can meet the application needs on different levels.

Select

Review

Automatic Mandarin Prosody Boundary Detection Based on Tone Nucleus and DNN Model

LIN Ju; XIE Yanlu; ZHANG Jinsong; ZHANG Wei

2016, 30(6): 35-39.

Abstract ( ) PDF ( )

Knowledge map

Save

Prosody boundary plays an important role in naturalness and intelligibility of verbal expressions. Thus, prosody modeling is also an important aspect of speech synthesis and understanding. Focused on the interaction of adjacent tones, we propose a method of prosody boundary detection based on tone nucleus and DNN model. This method calculates the boundary-related parameters by applying the tone nucleus features. Then, the parameters are modeled by the deep neural network. For comparison, the baseline system chooses syllable the acoustic feature. The experimental results show a relative 4% improvement achieved by the proposed method.

Select

Review

Analysis of Negative Expressions of Contemporary Chinese for Deep Semantic Representation

QIU Likun; HUANG Kun; HE Baorong; KANG Shiyong

2016, 30(6): 40-48.

Abstract ( ) PDF ( )

Knowledge map

Save

Negative expression plays an important role in deep semantic representation. Using corpus-based methods, this paper focuses on analyzing negative expressions and their usages in contemporary Chinese. First, we collect negative expressions and classifiy them into three types, i.e. explicit negatives, implicit negatives and negative constructions. Second, we analyze the rules of negative expressions, covering those used in modifying single-predicate structures, modality elements, predicate-complement structures, verbal coordinate structures, serial verb structures and pivotal sentences, we especially focus on discussing the effect of negative expressions in multi-predicate structures upon the meanings of propositions. The annotation scheme is also developed under the deep semantic representation framework. Finally, we investigate the distribution of negative expressions in multi-domain treebanks.

Select

Review

Extraction and Investigation of State Steady Words from 70 Years Newspapers

RAO Gaoqi; LI Yuming

2016, 30(6): 49-58.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the diachronic corpus of modern Chinese newspaper across 70 years, statistical measures are applied to detect the state-steady words. Altogether, 3 013 words are decided as the candidates according to their corpus coverage, time sensitivity and diachronic classification. Among them, verbs and nouns cover one third, respectively, and the rest consists of adjectives and function words. The average word length is 1.7 characters, distributed within top 7 609 in frequency list, and covering 90% of corpus. Basic morphemes and core words shape the features of the set in POS and length.

Select

Review

Identification of English Functional Noun Phrases #br# by CRFs and the Semantic Information

MA Jianjun; PEI Jiahuan; HUANG Degen

2016, 30(6): 59-66.

Abstract ( ) PDF ( )

Knowledge map

Save

The study on the automatic identification of English functional noun phrases (NP) may transform the task of resolving structural ambiguity caused by noun phrases into the task of NP chunking. Functional noun phrases refer to those noun phrases which are defined based on their syntactic functions in clauses. On a corpus of business domain, this study aims to identify both the scope of NP chunks and their syntactic function types by refining the Part-of-speech (POS) tagset, and adopting conditional random fields (CRFs) model combined with the semantic information. Modification to the Penn Treebank tagset is completed in the pre-processing, and semantic features are added to the CRFs model to improve the recognition of the adjunct types of noun phrases. Test results show that the system has achieved an F-score of 89.04% in the open test using our gold standard tags; and refining the POS tagset is a better approach for NP chunking, which has increased the F-score by 2.21%, compared with the model using the Penn Tree bank POS tags. This knowledge of English functional noun phrases is then combined with the NiuTrans SMT system, which slightly improves the English Chinese translation performance.

Select

Review

Product Feature Mining in Restricted Domain Spoken Dialogue System

YE Dashu; HUANG Peijie; DENG Zhenpeng; HUANG Qiang

2016, 30(6): 67-74.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper applies the product feature mining on a dialogue system of a mobile phone recommendation assistant, enhancing the focus of the system during the interaction. CBOW (continuous bag of words) language model is used to represent the sematic clue. A feature framework with exponential elongate static window is introduced to capture the import features among the interactions between words of variant distance. We finally utilize convolutional neural network (CNN) to perform product feature mining task. The word embedding representing sematic clue gives the relation between current word and the product feature, while the feature framework can alleviate the word ambiguity. The experiment shows that our model outperforms the state-of the act methods on product feature mining.

Select

Review

Chinese Frame Identification with Deep Neural Network

ZHAO Hongyan; LI Ru;ZHANG Sheng;ZHANG Liwen

2016, 30(6): 75-83.

Abstract ( ) PDF ( )

Knowledge map

Save

Frame identification is a basic task of semantic role labeling, which assigns a correct frame to the labeled target word based on the semantic scene. At present, the state-of-the-art methods are primarily based on statistical machine learning, in which the performance heavily depends on the quality of the extracted features. This paper proposes a DNN based frame identification method, trying to capture the target word context automatically. Experiments on the Chinese FrameNet and the Peoples Daily(March, 2003) show 79.64% and 78.58% accuracy, respectively.

Select

Review

Triple Classification Based on Synthesized Features for Knowledge Base

AN Bo; HAN Xianpei; SUN Le; WU Jian

2016, 30(6): 84-89.

Abstract ( ) PDF ( )

Knowledge map

Save

Triple classification is crucial for knowledge base completion and relation extraction. However, the state-of-the-art methods for triple classification fail to tackle 1-to-n, m-to-1 and m-to-n relations. In this paper, we propose TCSF (Triple Classification based on Synthesized Features) method, which can joint exploit the triple distance, the prior probability of relation, and the context compatibility between entity pair and relation for triple classification. Experimental results on four datasets (WN11, WN18, FB13, FB15K) show that TCSF can achieve significant improvement over TransE and other state-of-the-art triple classification approaches.

Select

Review

A Prototype Theory Study Using Cognitive Property Bank

LI Bin; SONG Li; YIN Siqi; QU Weiguang; WANG Meng

2016, 30(6): 90-99.

Abstract ( ) PDF ( )

Knowledge map

Save

As an important theory in cognitive science, the prototype theory is plausible to use properties to distinguish the central and periphery members in a category. However, there’s no quantitative evidence to support the theory. In this paper, we apply the cognitive property bank including 230,000 “word-property” pairs to validate the theory via 3 categories: bird, fruit and transportation. The results show that in Chinese, the typical members of bird are sparrow and swallow, which share many properties with bird. While the penguin and ostrich share very few properties with bird, especially lacking the key property fly. The data in cognitive property bank basically supports the idea of the prototype theory, but we also notice that the little bird has many properties, which make it available for a typical member in the category. We also distinguish between the tree based ontology and graph based categorization by bipartite graph.

Select

Review

Towards Breakdown Effect Intensity of Garden Path Sentences Processing #br# for Chinese English Learners: A Perspective of Computational Linguistics

DU Jiali ; YU Pingfang

2016, 30(6): 100-116.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on a time-restricted experiment in which 126 English major sophomores are required to decode 100 garden path sentences and control sentences, this article investigates the breakdown effect produced by Chinese English learners in the garden path sentence processing, quantifying of the intensity of breakdown effect, and making a comparative study against an machine translation system with the Stanford parser. Garden path phenomenon is a conscious and controlled behavior. The encoding and decoding reflect the phenomena of both processing breakdown and cognitive overload, as well as the complex psychological cognitive activities of human beings. The experiment proves that breakdown effects appear asymmetrically, with a top frequency and intensity occurred in the multi-category breakdown in contrast to the complementizer breakdown, object breakdown, embedded breakdown and multi-category breakdown. In the human computer comparative study, the machines program decoding and the learners cognitive decoding are not proved completely resonant or absolutely co-occurent.

Select

Review

Chinese-English Bilinguals ERP Activating Effect for English #br# during the Mother Tongue Semantic Processing

YANG Siqin; JIANG Minghu

2016, 30(6): 117-125.

Abstract ( ) PDF ( )

Knowledge map

Save

Adopting Event Related Potential, measuring the reaction time, error rate and N400, this paper investigates whether the advanced Chinese-English bilinguals retrieve the second language when they are processing their mother language. The results reveal that the implicit conditions, the first English pronunciation did not reflect on the reaction time. In the ERP result, while bilinguals are confronted with semantic related judgments, N400 invoked by the language areas fails a significant difference from the implicit first English pronunciation. However, when faced with the semantic unrelated judgments, N400 shows significant difference between each implicit first English pronunciation conditions. It is concluded that when advanced bilinguals are making comparatively complex and semantic judgments, the second language can be unconsciously retrieved.

Select

Review

Chinese Sentence Similarity Computing Based on Semantic Roles Annotation

TIAN Kun; KE Yonghong; SUI Zhifang

2016, 30(6): 126-132.

Abstract ( ) PDF ( )

Knowledge map

Save

In the process of semantic roles annotation, searching for similar annotated sentences is a common way to analyze such corpus. Existing methods cannot take full advantage of verbs and related elements, so they are unable to meet the demand of searching for similar annotated sentences. This article develops a new method to calculate Chinese sentence similarity focused on the verbs. Based on semantic roles annotation, the algorithm detects the similar sentences by analyzing the semantic roles, matching the annotated sentences, and calculating similarity between these matched sentences. To get a better result, the article also compares several other methods for word similarity, including algorithms based on How-net and Distributed Representation, and applies the best one into our algorithm. The experimental result indicates that the sentence similarity algorithm based semantic roles annotation performs better than traditional methods.

Select

Review

The Construction of Internet Slang Dictionary and Its Analysis

ZAN Hongying; XU Hongfei; ZHANG Kunli; SUI Zhifang

2016, 30(6): 133-139.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid development of the Internet, the internet stangs are becomming common and now shangs are constantly apparing. To deal with this challenge for natural language processing tasks like sentiment analysis, product recommendation, QA, etc., an internet slang dictionary is necessary. This paper analyzes the problems encountered when collecting and annotating micro-blog texts, together with other internet resources, to build the dictionary and the related corpus. Further, the potential applications of this dictionary and the corpus is discussed.

Select

Review

The Construction and Analysis of a Chinese Talk Shows Corpus

WANG Shan; LIU Rui

2016, 30(6): 140-146.

Abstract ( ) PDF ( )

Knowledge map

Save

The construction of a speech corpus is the foundation of research on oral languages. In this paper, a small-scale corpus is constructed based on the representative talk shows, QiangqiangSanrenxing and LuYuYouyue. An annotation system constituted by 5 primary categories and 16 subtypes is developed to annotate the conversational structures. According to the statistics of conversational structures, there are 309 interrupted structures, 141 inserted structures, 111 repetitive structures, 653/589 question and answer structures, 51/21 obstruction-correction structures, which reflect the unbalanced distribution of the number of conversational structures. The form, nature and communicative tasks of the talk shows are the main influencing factors of the distribution of the conversational structure. In addition, conversational structures show certain patterns, and therefore trigram analysis is carried out to explore the combinations. It is found that the highest frequency combination in the corpus is the question-answer adjacency pair, in addition to a large number of contingency combinations.The combination patterns of conversation structures not only reflect the style of the talk shows, but also help to analyze the functional modules in the conversation, the formation of conversation strategies, and thus help us more deeply understand the operational mechanisms of the conversation.

Select

Review

Research on the Expansion of Semantic Knowledge Resources Based on Distant Supervision

LU Dawei; WANG Xingyou; YUAN Yulin

2016, 30(6): 147-155.

Abstract ( ) PDF ( )

Knowledge map

Save

The semantic knowledge resources containing extensive linguistic information are one of the important interfaces of linguistics and language engineering. In this paper, we study the automatic expansion of semantic knowledge resources by the example of the Adjective Syntactic-Semantics Dictionary. We aim to extend the vocabulary of the dictionary and their syntactic patterns via the large corpus. More specifically, our method is to classify the words in dictionary into 97 categories by their syntactic patterns, and mapping the new words which are not existing in the dictionary into each category, thereby the whole task can be treated as a multi-class classification issue. The method is based on the fact that the new words and the dictionary words have the similar syntactic patterns in large corpus. We construct the training data by distance supervision, so as to reduce the effort of manual annotation. Training process combines the shallow learning and the deep neural network, which achieves the promising results. The experimental results show that the deep neural network is able to learn the syntactic information, and effectively improve the accuracy in the mapping task.

Select

Review

A New Pseudo Relevance Feedback Based on Pseudo Document

YAN Rong; GAO Guanglai

2016, 30(6): 156-163.

Abstract ( ) PDF ( )

Knowledge map

Save

The classical Pseudo Relevance Feedback (PRF) usually chooses the document as the unit, which would decrease the quality of expansion due to the larger extraction unit. Applying the topic analysis techniques, this paper proposes to use the semantic content of text as the expansion unit. Based on the proposed pseudo document description of each document in collection, the expansion terms are decided by using implicit diversification on the more subtle document content level. The experimental results on real NTCIR8 dataset show an clear improvement in terms of PRF performance.

Select

Review

Frame Semantic Based Answer Sentences Extraction for #br# Chinese Reading Comprehension in College Entrance Examination

LI Guochen; LIU Shulin; YANG Zhizhuo; LI Ru; ZHANG Hu; QIAN Yili

2016, 30(6): 164-172.

Abstract ( ) PDF ( )

Knowledge map

Save

Reading comprehension QA for Chinese College Entrance Examination is much more difficult than general reading comprehension QA in that it requires deeper linguistic analysis technology to understand the question, and the semantic correlation between the answers and questions. This paper proposes to extract the candidate answer sentences by frame semantic match and frame-frame semantic relation, and the manifold-ranking model are applied to propogate the frame semantic relevancy to decide the top-four candidate answers. The accuracy and recall on the college entrance examination of Beijing in recent twelve years is 53.65% and 79.06%, respectively.

Select

Review

Research on Key Technology of Automatic Essay Scoring #br# Based on Text Semantic Dispersion

WANG Yaohua; LI Zhoujun; HE Yueying; CHAO Wenhan; ZHOU Jianshe

2016, 30(6): 173-181.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the existing methods, including LDA model, paragraph vector, word vector text, we extract four kinds of text semantic dispersion representations, and apply them on the automatic essay scoring. This paper gives a vector form of the text semantic dispersion from the statistical point of view and gives a matrix form from the perspective of decentralized text semantic dispersion, experimented on the multiple linear regression, convolution neural network and recurrent neural network. The results showed that, on the test data of 50 essays, after the addition of text semantic dispersion feature, the Root Mean Square Error is reduced by 10.99% and the Pearson correlation coefficient increases 2.7 times.

Select

Review

Dialogue Act Recognition for Out-of-Domain Utterancesin Spoken Dialogue System

HUANG Peijie; WANG Jundong; KE Zixuan; LIN Piyuan

2016, 30(6): 182-189.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to the short length, diversity, openness and colloquial features of out-of-domain (OOD) utterances, such dialogue act (DA) recognition for OOD utterances remains a challenge in domain specific spoken dialogue system. This paper proposes an effective DA recognition method using the random forest and external information. The unlabeled Weibo dataset, which is not domain specific yet possesses the similar characteristic of colloquialism and diversity with the spoken dialogue, is used to train the word embedding by unsupervised learning method. The trained word embedding provides similar computing for out of vocabulary (OOV) words in the training and test OOD utterances. The evaluation on a Chinese dialogue corpus in restricted domain shows that the proposed method outperforms some state-of-the-art short text classification methods for DA recognition.

Select

Review

Research on Recognition and Translation of Chinese-Uyghur Time #br# and Numeral and Quantifier

Ayiguli Halike;Hasan Wumaier;Tuergen Yibulayin;
Kahaerjiang Abiderexiti; Maihemuti Maimaiti

2016, 30(6): 190-200.

Abstract ( ) PDF ( )

Knowledge map

Save

The Chinese-Uyghur statistical machine translation system for times, numerals and quantifiers generalization ability are relatively weak. This paper uses a corpus approach to mine and extract the Chinese times, numerals and quantifier, realizing context based ambiguous quantifier translation. Experimental results show that the proposed method achieves 93.23%, 90.15%, 96.55%, and 87.58% in F-measure for the translation of times, numerals, unambiguous quantifiers and ambiguous quantifiers.

Select

Review

Integration of Passive and Active Voice Model into Japanese-Chinese #br# Statistical Machine Translation

WANG Nan;XU Jin’an;MING Fang;CHEN Yufeng ;ZHANG Yujie

2016, 30(6): 201-207.

Abstract ( ) PDF ( )

Knowledge map

Save

The suffixes of Japanese predicates have complex formation of different voice. Both passive and potential predicates are formed with the same suffix which originated from the same stem, which cause mistranslation in statistical machine translation. In this paper, a new method has been proposed for rule selection among different voice. Maximum entropy models are built to effectively classify passive and potential voice, and then voice features are integrated into the log-linear model translation model. In Japanese to Chinese translation task, large scale experiment shows that our approach improves the translation performance from 41.50 to 42.01 in BLEU score, and the informativness is 2.71% higher according to the human evaluation results.

Select

Review

The Default Comment Object Identification Based on Condition Random Fields

TANG Wenwu; GUO Yi;; XU Yongbin; FANG Xu

2016, 30(6): 208-214.

Abstract ( ) PDF ( )

Knowledge map

Save

The identification of the default objects and attributes in a comment is important in sentiment analysis for the commerce website’s reviews. To resolve the default comment objects and attributes, this paper proposes an effective identification method based on Conditional Random Fields (CRF). After applying an emotion dictionary to locate the opinion comments, we treat this task as a sequence labeling problem, and choose the lexical and dependency parsing elements as features. The evaluation results prove the proposed method with reasonable good accuracy and recall rates.

Select

Review

Pragmatic Analysis of Irony Based on Hybrid Neural Network Model with Multi-feature

SUN Xiao; HE Jiajin; REN Fuji

2016, 30(6): 215-223.

Abstract ( ) PDF ( )

Knowledge map

Save

In social media, there are a lot of ironies or satires, which imply certain emotional tendencies. However, the pragmatic tendency of these special language phenomena is most often a far cry from its literal meaning, which challerges the text sentiment analysis in social media. This paper studies irony recognition in Chinese social media, and constructs a corpus contains irony and satire. It demonstrates the importance of structural and semantic features of ironies in text recognition. This paper also presents an efficient multi-feature hybrid neural network model, which fuses the Convolutional Neural Network and LSTM sequential models. The experimental resitst prove that the proposed model is superior to the traditional neural network models and BOW (bag-of-words) model.

Select

Review

Dependency Parsing of Tibetan Compound Sentence

Huaquecairang;ZHAO Haixing

2016, 30(6): 224-229.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a discriminative method of identifying the clause to solve the performance decrease caused by Tibetan compound sentence. In this method, the complex sentence is first divided into different syntactic analysis units according to the inherent features of conjunctions. Then each clause is parsed independently. Finally the whole dependency tree is generated by merging the parse of each clause. Experimental results show that the method could decrease the complexity of parsing, and boost the parsing accuracy up to 88.72%.

Select

Review

Homonyms Disambiguation Based on Mongolian Nouns Semantic Network

Hasi; Buyinqiqige

2016, 30(6): 230-235.

Abstract ( ) PDF ( )

Knowledge map

Save

Mongolian homographs disambiguation is one of the difficulties of the Mongolian information processing. This paper puts forward a method of homonyms disambiguation based on Mongolian nouns semantic network. Finally, the experimental results of the homograph disambiguation are provided.

Please choose a citation manager

Content to export

2016 Volume 30 Issue 6 Published: 15 December 2016