Journal of Chinese Information Processing

Select

Language Analysis and Calculation

A Study of “X Shenme” and Related Constructions Used for Negation

XIA Xue, ZHAN Weidong

2015, 29(5): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Two different negations are distinguished in this paper—illocutionary negation and propositional negation. Speaker uses illocutionary negation to express his negative attitude, such as blame and dissuasion, towards somebodys certain behaviors. Propositional negation either directly negates the truth of a proposition, or negates the suitable conditions of a proposition, or expresses the meaning that X doesnt reach certain standards. Furthermore, the basic elements of each negation and the relations between these elements are analyzed, and the constraints of the variable X of each constructions of each kind of negations are discussed. The differences and similarities of these constructions are listed in the end.

Select

Language Analysis and Calculation

What Kind of Semantic Knowledge is Neccessary for the Semantic Description and Research of Nouns？

LI Qiang, YUAN Yulin

2015, 29(5): 9-20.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper mainly discussed the semantic description and the research of nouns. Firstly, several main lexical semantics theories (include structuralism semantics, generative semantics, conceptual semantics and natural semantic metalanguage) were introduced and reviewed, and their defects and deficiencies for semantic description were also discussed. Then, the qualia structure in generative lexicon theory was introduced, and its differences with the theories mentioned above and its features were illustrated. Finally, based on the generative lexicon theory, four examples of analysis on nouns using qualia structure (include word default, metaphor meaning generation, affordance sentence, middle construction) were exhibited and its possible application in natural language processing was shown.

Select

Language Analysis and Calculation

Revealing Covert Laws in Language Systems Through Graphs——Algorithms of Winner-Get-All & Winner-More-Loser-Less

CHEN Zhenning, CHEN Zhenyu

2015, 29(5): 20-31.

Abstract ( ) PDF ( )

Knowledge map

Save

We tried to reveal convert laws with quantitative analysis through graphs and designed two generating algorithms of language graphs: Winner-get-all and Winner-more-loser-less, which extend the game theory used by idea-algorithm to none-perfect state. Compared to previous methods, the proposed two algorithms have better generalization capability. Especially, we balance between full and modest generation in the Winner-more-loser-less algorithm. There are two kinds of inductive algorithms to mine mainstream rules and analyze linguistic laws: Min-Subgraphs for accuracy, as well as Max-Subgraphs for coverage. A formula for control degree based on min-subgraphs is put forward to evaluate language systems.

Select

Language Analysis and Calculation

Quantitative Research on the Processing Breakdown in Garden Path: A Computational Linguistic Perspective

DU Jiali, YU Pingfang

2015, 29(5): 31-39.

Abstract ( ) PDF ( )

Knowledge map

Save

This article discusses the confusion quotient (CQ) index in the processing breakdown of the garden path phonomenon. The presence of asymmetric information breakdown could lead to spiral upward trend of decoding which showed the pattern of double negation. The amplitude of potential effects of processing breakdown could be measured through the CQ index. Based on large data corpus statistics and online parser analytic method, we calculate the value of CQ index. CQ duration for the preferred construction lies between (-∞, 1], and for the non-preferred construction, [1,2]. The critical values for the preferred and non-preferred structures are 0.72 and 1.28 respectively, and the ambiguous domain lies in [0.72, 1.28]. It is concluded that the frequency deviation of multi-structures is a fundamental reason to lead to different CQ index. The amplitude of processing breakdown and magnitude of asymmetry information compensation are related to CQ index. It is revealed that the statistics-based CQ index can provide the prospective information for decoding the complex structure of local ambiguity.

Select

Language Analysis and Calculation

Semantic Computing of Spatial Metaphor Based on Deaf Persons Cognition Cases

YAO Dengfeng, JIANG Minghu, Abudoukelimu Abulizi, HOU Renkui,Halidanmu Abudukelimu

2015, 29(5): 39-49.

Abstract ( ) PDF ( )

Knowledge map

Save

The metaphor processing is a challenging issuein natural language processing. From the perspective of psycholinguistics, spatial metaphor is made to perform similar classification and computation. We use multidimensional scaling and clustering method, with the deaf as subjects for two experiments, and showe thatthe deaf use the topographic space and syntactic space for computational implementation of spatial metaphor comprehension. At the same time, influenced by sign language, the deaf’s cognitive subjects of spatial metaphor include signers’ own reference frame, references’ relative coordinates and the sign space saturation, within the boundary of the part of hand or chest. It is also revealed from the experiments that, due to the presence of two kinds of space, spatial metaphor understanding in the deaf brain is leveled, as suggested by the the Sapir Whorf hypothesis, with the structure and representation of spatial metaphor influenced by its interaction between topographic space and syntactic space.

Select

Morphology and Segmentation

An Incremental Learning Scheme for Perceptron Based Chinese Word Segmentation

HAN Bing, LIU Yijia, CHE Wanxiang, LIU Ting

2015, 29(5): 49-55.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we propose an incremental learning scheme for perceptron based Chinese word segmentation. Our method can perform continuous training over a fine tuned source domain model, enabling to deliver model without annotated data and re-training. Experimental results shows the scheme proposed can significantly improve adaptation performance on Chinese word segmentation and achieve comparable performance with traditional method. At the same time, our method can significantly reduce the model size and the training time.

Select

Morphology and Segmentation

Active Learning Based Domain Adaptation for Chinese Word Segmentation

XU Huating, ZHANG Yujie, YANG Xiaohui, SHAN Hua, XU Jinan, CHEN Yufeng

2015, 29(5): 55-63.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese word segmentation systems trained on annotated corpus of newspaper would drop in performance when faced with a new domain. Since there is no large scale annotated corpus on the target domain, this paper describes a domain adaptation of Chinese word segmentation by active learning. The idea is to select a small amount of data for annotation to bridge the gap from the target domain to the News. The word segmentation model is re-trained by inlduing the newly annotated data. We use the CRF model for the training and a raw corpus of one million sentences on patent description as the target domain. For test data, 300 sentences are randomly selected and manually annotated. The experimental results show that the performances of the Chinese word segmentation system based on our approach are improved on each evaluation metrics.

Select

Morphology and Segmentation

A Study on Semantic Word-Formation of Bi-Character Words for Common Unknown Word Understanding

JI Zhiwei, FENG Minxuan

2015, 29(5): 63-69.

Abstract ( ) PDF ( )

Knowledge map

Save

The approach to investigate the semantic rules in word- formation via the the granularity of the morpheme can help understand natural language .This paper first labeles the sense of the front and back morpheme of the two-character words by referring to the Modern Chinese Dictionary and HowNet. Then we labele the lexicalized meaning between the morphemes from the perspectives of the structure of semantic combination, the distribution of semantic root, the mode of semantic combination and the type of semantic variation. Finally, we combined the morpheme meaning with lexicalization meaning quantitatively to set up a semantic scheme to account for the two-character words . Tested by the two-character words from BBS and the Modern Chinese Dictionary, it reveals some applicationvalue on the understanding of common unknown words.

Select

Language Resources Constrution

Construction of Multi-Domain Chinese Dependency Treebanks and A Study on Factors Influencing the Statistical Parsing

QIU Likun, SHI Linlin, WANG Houfeng

2015, 29(5): 69-76.

Abstract ( ) PDF ( )

Knowledge map

Save

To boost Chinese dependency parsing and analyze factors influencing Chinese dependency parsing, we constructe a large-scale general treebank and several middle-scale treebanks for specific domains. Then, we performe experiments to evaluate the parsing accuracy influenced by the quality, the scale and the domain difference of the dependency treenbank. The results show that both the treebank quality and its scale are positively related to parsing accuracy, and the quality is more influential. The experiments also demonstrate that general treebanks and domain treebanks are complementary, and, whether a general treebank and domain treebank should be used together is dependent on the difference between them.

Select

Language Resources Constrution

Word2vec Based Word Alignment Corpus for the Greater China Region

WANG Mingwen,XU Xiongfei,XU Fan,LI Maoxi

2015, 29(5): 76-84.

Abstract ( ) PDF ( )

Knowledge map

Save

We deal with the linguistic phenomenon that different expressions to the same semantic meaning among the Mainland China, Hong Kong and Taiwan, a.k.a., the greater China region(GRC). Firstly, we automatically crawl 3.2 million GCR parallel sentences from the wikipedia and the news website with simplified and traditional encoding, and then manually annotate 10 000 GCR parallel word alignment corpora with an annotation agreement of more than 95%. Meanwhile, we present a 2-phase GCR word alignment model based on word2vec representation of the GCR words the cosine similarity measure and other post-processing techniquest. Experiment results on the proposed 2 different word alignment corpus demenstrate the effectiveness of our GCR model which significantly outperforms the current GIZA++ and HMM-based models. Furthermore, we generate 90,029 triples from wikipedia with accuracy over 82.66%.

Select

Information Extraction and Text Mining

Monolingual Corpora Based Japanese-Chinese Translation Extraction for Kana Names

WANG Dongming, XU Jinan, CHEN Yufeng, ZHANG Yujie

2015, 29(5): 84-91.

Abstract ( ) PDF ( )

Knowledge map

Save

Named entity translation equivalents play a critical role in cross-language information processing. The traditional method is usually based on large-scale parallel or comparable corpus, which is limited by the size and quality of the corpus resources. In Japanese-Chinese translation, the bilingual corpora resources are relatively scarce: the Chinese Hanzi and Japanese Kanji mapping table is often adopted to deal with Chinese named entity and a SMT model to deal with the Japanese named entities in pure kana. In this paper, we propose a monolingual corpora based approach. Firstly, the conditional random field model is adopted to extract Japanese and Chinese names from monolingual corpus. Then the Japanese-Chinese transliteration rule base is developed by instance based inductive learning in a iterative process employing the feedback learning. Experimental results show that the proposed method is simple and efficient, leverging the severely dependency on bilingual resource by the classical methods.

Select

Information Extraction and Text Mining

Classifying Named Entities on Chinese Wikipedia

XU Zhihao,HUI Haotian,QIAN Longhua,ZHU Qiaoming,

2015, 29(5): 91-98.

Abstract ( ) PDF ( )

Knowledge map

Save

Classifying Wikipedia Entities is of great significance to NLP and machine learning. This paper presents a machine learning based method to classify the Chinese Wikipedia articles. Besides using semi-structured data and non-structured text as basic features, we also extend to use Chinese-oriented features and semantic features in order to improve the classification performance. The experimental results on a manually tagged corpus show that the additional features significantly boost the entity classification performance with the overall F1-measure as high as 96% on the ACE entity type hierarchy and 95% on the extended entity type hierarchy.

Select

Information Extraction and Text Mining

On Text Content Computing within an Topic

LIU Dongming, YANG Erhong

2015, 29(5): 98-104.

Abstract ( ) PDF ( )

Knowledge map

Save

The topic detection can effectively organize the vast information into topics with the unit of text, but end users do not need all the texts on a topic. Instead, they may just demand certain specific content of the topic. To achieve the intelligent push of the relevant content in a topic to the user, it is essential to select the corresponding part of the texts according to the needs of users. This paper compares the contents between the texts in a topic and effectively selects the texts which meets the needs of the user. We redefine the topic and represent the topic and the text according to this definition. Then we design a computation method between the texts and topic based on this representation. Finally, the experiment demonstrates the effectiveness of this approach.

Select

Information Extraction and Text Mining

An Improving Method for Social Media Text Normalization

SONG Yajun, YU Zhonghua, CHEN Li, DING Gejian, LUO Qian

2015, 29(5): 104-112.

Abstract ( ) PDF ( )

Knowledge map

Save

The informal style of social media texts challenges many natural language processing tools, including many keyword-based methods proposed for social media textTherefore, the normalization of the social media text is indispensable. Based on the assumption of context similarity between the lexical variants, we proposed an improved graph-based social media text normalization method by introducing word embedding model to better capture the context similarity. As an unsupervised and language independent method, it can be used to process large-scale social media texts of various languages. Experimental results show that the proposed method outperforms the of previous methods with the best F-score.

Select

Information Extraction and Text Mining

Research on Gauss Weighed Reorganization K-NN

LIU Zuoguo, CHEN Xiaorong

2015, 29(5): 112-117.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a K-NN text clustering algorithm employing uses Gauss Weighed Distance and Cluster Reorganization Mechanism. The concept of Nearest Domain is proposed and Nearest Domain Rules are elaborated. Then Gauss Weighing Algorithm is designed to Quantification samples’ distance and weights. A text is weighed based on the distance from cluster center via Gauss function in order that distances of clusters can be calculated. Further, Cluster Reorganization Mechanism will make a self-adaption to the amount of clusters. Splitting operator separates sparse clusters and adjusts abnormal texts while consolidating operator combines similar ones. Clustering experiment shows that reorganization process effectively improves the accuracy and recall rate and makes result more reasonable by increasing the inner density of clusters.

Select

Information Extraction and Text Mining

Chinese Micro-blog Named Entity Linking Based on Multisource Knowledge

CHEN Wanli, ZAN Hongying,WU Yonggang

2015, 29(5): 117-125.

Abstract ( ) PDF ( )

Knowledge map

Save

Named entity is an important component conveying information in texts, and an accurate understanding of named entities is necessary to ensure a correct analysis of the text information. This paper proposes a Chinese micro-blog entity linking strategy based on multi-resource knowledge under Ranking SVM framework. It combines a dictionary of synonyms, the encyclopedia resources to produce an initial set of candidate entities , then extracts various combinations of featuresfor Ranking SVM to generate the target entity set. The evaluation on data sets of NLP&CC2014 Chinese micro-blog entity linking track shows a micro average accuracy of 89.40%, which is better than the state-of-the-art result.

Select

Mathine Translation

English-Chinese Translation Unit and Translation Model for Discourse-Based Machine Translation

SONG Rou,GE Shili

2015, 29(5): 125-136.

Abstract ( ) PDF ( )

Knowledge map

Save

The primary issue in discourse-based machine translation (MT) is to define the translation unit. Based on English and Chinese linguistic knowledge and English-Chinese translation practice, we propose a double level system of translation units for discourse-based MT, including the basic unit and the compound unit. We further explore the properties of these two types of units and construct a three-step discourse-based MT model: parsing, translating and assembling (PTA model). This paper suggests that the compound unit for Chinese discourse-based MT is the text corresponding to the generalized topic structure and the basic unit is the topic sufficient sentence derived from the stream model of the generalized topic structure; while the compound unit for English is the traditional sentence and the basic unit is the naming-telling clause (NT clause), namely, the clause constructed with the referential component and its description or post-modification component. This paper exhibits the process of English-Chinese translation with an example under the framework of the double level translation unit system and PTA model, and finally outlines a plan for the construction of English-Chinese clause aligned corpus for discourse-based MT.

Select

Mathine Translation

Enhance Automatic Evaluation of Machine Translation by Markov Network Based Paraphrases

WENG Zhen, LI Maoxi, WANG Mingwen

2015, 29(5): 136-143.

Abstract ( ) PDF ( )

Knowledge map

Save

It is a challenge to match the different expressions (words or phrases) which have the same meanings in the automatic evaluation of machine translation. Many researchers proposed to enhance the matches between the words in machine translation and in human references by extracting paraphrases from bilingual parallel corpus or comparable corpus. However, the cost of constructing the bilingual parallel corpus or the comparable corpus is high; furthermore, it is difficult to obtain a large corpus between some language pairs. In this paper, the paraphrases are extracted from the monolingual texts in the target language by constructing the Markov networks of words, and applied to improve the correlation between the results of automatic evaluation and the human judgments of machine translation. The experimental results on WMT14 Metrics task showed that the performances of the proposed approach of extracting paraphrase from monolingual text are comparable to that of extracting paraphrase from bilingual parallel corpus.

Select

Social Computing and Sentiment Analysis

Topic-Oriented Monitoring of Public Sentiment towards Popular Weibo Events ——A Case Study on “Regular ‘Odd-even’ Vehicle Restriction in Beijing”

ZHANG Yu, LI Bing, LIU Chenyue

2015, 29(5): 143-152.

Abstract ( ) PDF ( )

Knowledge map

Save

The monitoring of the public sentiment is a popular issue in the study of social media where a myriad of researches concentrate on the general trend of public sentiment towards certain event. However, few of them has analyzed the public sentiment towards various topics on the event. This paper focuses a topic-oriented sentiment analysis on temporal term. And the Weibo on the event of ‘regular’ odd-even ‘vehicle restriction in Beijing’ is selected as the target of our work. By observing the sentimental trend of the different topics on this event, we attempt to offer feasible suggestions for public sentiment monitoring.

Select

Social Computing and Sentiment Analysis

Polarity Shifting and LSTM Based Recursive Networks for Sentiment Analysis

LIANG Jun, CHAI Yumei, YUAN Huibin, GAO Minglei, ZAN Hongying

2015, 29(5): 152-160.

Abstract ( ) PDF ( )

Knowledge map

Save

The chain-structured long shortterm memory (LSTM) has been shown to be effective in a wide range of tasks such as language modeling, machine translation and speech recognition. Because it cannot storage the structure of hierarchical information language, we extend it to a tree-structure based recursive neural network to capture more syntactic and semantic information, as well as the sentiment polarity shifting. Compared to LSTM, RNN etc, the proposed model achieves a state-of-the-art performance.

Select

Social Computing and Sentiment Analysis

Chinese Subjective Sentence Recognition Based on Fuzzy Inference Machine

SONG Hongwei, SONG Jiaying, FU Guohong

2015, 29(5): 160-167.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a fuzzy inference machine for Chinese subjectivity identification. We first define two fuzzy sets for lexical subjectivity and objectivity, respectively. Then, we apply TF-IDF to acquire the relevant membership functions from the training data. Finally, we define two fuzzy IF-THEN rules and thus build a fuzzy inference machine for Chinese subjective sentence recognition. We conduct two experiments on the NTCIR-6 Chinese opinion data. The experimental results demonstrate the feasibility of the proposed method.

Select

Application of NLP

A Stylistic Analysis of Jin Yongs and Gu Longs Fictions Based on Text Clustering and Classification

XIAO Tianjiu, LIU Ying

2015, 29(5): 167-178.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the fictions written by Jin Yong and Gu Long, this paper analyzes the sentence fragmentation and text conformity from the perspective of computational stylistics. The twelve texts are clustered using n-gram of words, n-gram of part of speech, n-gram of punctuations and six other features as features. Besides, the principal components analysis and the text classification are applied with eight features. The results of experiments show that there exist great style differences between Jin Yongs and Gu Longs fictions: Jin Yongs fictions are more colloquial than Gu Long’s; Jin Yong use more words and idioms from dialects and slang while the expressions in Gu Longs fictions are more formal. Whats more, there are differences between the two authors’ fictions on the syntactic structures, phrase structures, rhythms, readabilities and the language variation.

Select

Application of NLP

Exploiting Distributed Representation of Words for Better Off-Topic Essay Detection

CHEN Zhipeng, CHEN Wenliang,ZHU Muhua

2015, 29(5): 178-185.

Abstract ( ) PDF ( )

Knowledge map

Save

Similarity measure is the core component of off-topic essays detection. To compute the text similarity, the bag-of-words model is widely used, which represents a text as a vector with each dimension corresponds to a word. To further capture the word semantic information, this paper proposes a new method to compute text similarity: a method exploits word distributed representation. The proposed method combines the traditional bag-of-words model with the word semantic information. For each word in a text, we search for a set of similar words in a text collection, and then extend the text vector with these words. Finally we compute text similarity with the updated text. Experimental results show that our method is more effective than baseline systems.

Select

Application of NLP

Extraction of Power Relationship and Its Corresponding Social Network in The Story of Stone

CHEN Lei, HU Yimin,AI Wei, HU Junfeng,

2015, 29(5): 185-194.

Abstract ( ) PDF ( )

Knowledge map

Save

The study of social status has always been a hot spot in sociolinguistics. In this study, we applied Snowball Algorithm and HITS Algorithm to discover the social relationships in the Chinese novel The Story of the Stone. By locating and weighting “Patterns” and “Tuples” iteratively, we construct a relationship network with social class information. Finally, we generate a min-cost arborescence of the social relationships of 192 main characters in The Story of the Stone with Chu–Liu/Edmonds' algorithm. The generated social relationship reflects not only the intimacy and social influences, but also the hierarchical inequality of people. We regard it as a more objective and authentic reflection of social relationship network in class society.

Select

Application of NLP

A Coprocessor for Out-of-Domain Utterances in Domain Specific Spoken Dialogue System

WANG Jundong, HUANG Peijie, LIN Xianmao, XU Yuhong, LI Kaiyin

2015, 29(5): 194-204.

Abstract ( ) PDF ( )

Knowledge map

Save

The openness, colloquialism and diversity of out-of-domain (OOD) utterances make them difficult to the domain specific spoken dialogue system. This paper tackles this problem by proposing a coprocessor for domain specific dialogue system. Based on the artificial intelligence markup language(AIML), open semantic understanding templates are designed, and understanding template classification is used to address the unmatched OOD utterances. Then the extended finite state machine (EFSM) is adopted to transform the understanding template into answering template and realize the control of the state and information of the dialogue process. The application in mobile phone shopping guide domain shows that the proposed coprocessor can effectively help the Chinese dialogue system to finish the dialogue process and get better user experience.

Select

Other Languages in/around China

A Multi-Strategy Approach to Uyghur Stemming

Sediyegvl Enwer, Xiang Lu, Zong Chengqing, Akbar Pattar, Askar Hamdulla

2015, 29(5): 204-211.

Abstract ( ) PDF ( )

Knowledge map

Save

Uyghur is an agglutinative language with complex morphology, Uyghur words stem segmentation plays an important role in Uyghur language information processing. But so far, the performance of the Uyghur words stem segmentation still has much room for improvement .According to the constraints of Uyghur word formation, we proposed a stem segmentation model for Uyghur which fuses the part of speech feature and context information based on N-gram model. Experimental results show that, the part of speech feature and the context information of stem can increase the performance of Uyghur words stem segmentation significantly with the accuracy reaching 95.19% and 96.60% respectively compared to the baseline system.

Select

Other Languages in/around China

Tibetan POS Tagging Based on Syllable Tagging

LONG Congjun , LIU Huidan, NUO Minghua, WU Jian

2015, 29(5): 211-216.

Abstract ( ) PDF ( )

Knowledge map

Save

A Tibetan corpus is constructed and annotated for the syllable markers, the word boundary markers and the part-of-speech(POS) tags, with texts selected from Tibetan textbooks of Primary and middle school. Then an empirical study reveals that the training data with the multi-level annotation can enhance the effects of POS tagging. Due to the strong relation between the POS tags of words and the tags Tibetan syllables, a method of Tibetan POS tagging by the Tibetan syllables is presented. The results of experiments show that syllable tags can correct certain errors caused in POS tagging.

Select

Survey

A Survey of Chinese Sign Language Processing

YAO Dengfeng , JIANG Minghu, ,Abudoukelimu Abulizi, LI Hanjing,Halidanmu Abudukelimu,XIA Dina

2015, 29(5): 216-228.

Abstract ( ) PDF ( )

Knowledge map

Save

For the computer processing of Chinese sign language, the characteristics of the sign language should be consideredt. This paper discusses the problems related to Chinese sign language information processing and proposes the processing technology according to the domestic and foreign research progress. Based on the lexical and syntactic characteristics of Chinese sign language and the latest research results in foreign Sign Linguistics, this paper puts forward a solution to the processing of Chinese sign language. We suggest that the future study of sign linguistics will rely more on the interdisciplinary study and multi-mode approach, and its progress will promote the technology of information accessibility.

Please choose a citation manager

Content to export

2015 Volume 29 Issue 5 Published: 10 September 2015