Journal of Chinese Information Processing

Select

Language Analysis and Calculation

A Study on Quotation Recognition Based on Sequence Labeling

JIA Honghao, LUO Zhiyong

2019, 33(2): 1-7.

Abstract ( ) PDF ( )

Knowledge map

Save

The automatic recognition of inter-sentence quotation relationship is a valid issue in discourse analysis. The quotation relationship between sentences influences the analysis of sentence groups. At present, there are few studies on the relationship between quotations in natural language processing. This paper attempted to make a preliminary exploration of the relationship between quoted sentences and studied the identification of quotation with conditional random fields(CRF) and Bidirectional Long Short-Term Memory network Enhanced CRF (BLSTM-CRF). It introduces the governors in the leading sentence into the model. The experimental results show that CRF model performs better with 85.49% in precision, and BLSTM outperforms with 79.60% in F-value.

Select

Language Analysis and Calculation

Semantic Annotation Framework for Chinese Logical Complement

ZHANG Kunli, HAN Yingjie, JIA Yuxiang, MU Lingling, SUI Zhifang, ZAN Hongying

2019, 33(2): 8-16.

Abstract ( ) PDF ( )

Knowledge map

Save

Logical complement semantics is referred to as the meaning expressed by elements of negation, degree, tense and aspect, modality and mood that are attached to a basic predicate-centered proposition in a sentence. It is embodied as the semantic constraint relation between logical semantic operators and the predicate. Logical complement semantics as an effective supplement of semantic relations expressed by elements in a basic proposition is important for deep understanding of sentence semantics. This paper proposes a Chinese logical complement semantic annotation framework for deep semantic understanding. Specifically, classification systems and operator dictionaries are constructed for representing negation, degree, tense and aspect, and mood based on existing research results. Annotation rules are established to annotate logic complement semantics for the sentences which have been tagged for basic propositional arguments. Finally, the statistics of annotation results is presented, and the problems in annotation process are also analyzed.

Select

Language Analysis and Calculation

Chinese Chunked-based Heterogeneous Entailment Parser and Boundary Identification

JIN Tianhua, JIANG Shan, YU Dong, ZHAO Meiqian, LIU Lu

2019, 33(2): 17-25.

Abstract ( ) PDF ( )

Knowledge map

Save

Textual entailment(RTE) is a challenging issue for natural language processing. This paper proposes to categorize the textual entailment into three tyes: lexical entailment, chunked-based heterogeneous entailment and common-sense entailment. Focused on the concept of chunked-based heterogeneous, we further present a chunk annotation standard and a labeled dataset. Then we explore the rule-based model and the deep learning model respectively for the automatic detection of the chunk entailments. The experimental results show that the deep learning model adopted in this paper can discover the entailment fragments effectively.

Select

Language Analysis and Calculation

On the English Translation of De-construction in Legal Texts — A Case Study on Chinese-English Parallel Corpus of General Principles of the Civil Law

FENG Wenhe, GUO Haifang, YANG Hua

2019, 33(2): 26-35.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper identifies and analyzes the English translation of Chinese De-construction which expresses the conditional relation in the legal texts. Quantitative investigations on the English translation of Chinese De-construction in the legal texts of General Principles of the Civil Law shows that: 1. There are more adverbial clauses than attributive clauses (85.40%>14.60%). 2. Finite forms appear more frequently than non-finite forms (87.59%>12.41%). And “Present time” accounts for the absolute majority (99.17%) in the finite forms, preposition phrases occupies the majority in the non-finite forms (64.71%). 3. “If” ranks top among the adverbial introduction wordop (86.32%), and “who” among the attributive introduction word (60.00%). This paper suggests that the De-construction in Chinese legal texts is a clause rather than a phrase, and the word “De” is a discourse connective indicating a conditional relation.

Select

Language Resources Construction

Construction of Chinese Dependency Syntax Treebanks for Multi-domain and Multi-source Texts

GUO Lijuan, PENG Xue, LI Zhenghua, ZHANG Min

2019, 33(2): 34-42.

Abstract ( ) PDF ( )

Knowledge map

Save

The existing Chinese dependency treebanks are mainly annotated for canonical texts, and give little consideration to web texts, such as blogs, WeiBo, and WeChat. This paper presents a large-scale tree-bank annotation, based on the recently designed annotation guideline and online annotating system. Altogether 15 part-time annotators are involved and a strict annotation procedure is applied to guarantee the quality. So far, we have annotated about 30,000 Chinese sentences with their dependency syntax trees, including about 10,000 sentences from Taobao headline texts. This paper describes the details in data selection and annotation workflow. We also analyze the annotation accuracy, inter-annotator consistency, and distribution of annotated data.

Select

Language Resources Construction

Construction of Parallel Corpus of Chinese and Sign Language for ELAN

WU Ruizhu, LI Hanjing, LV Huihua, YAO Dengfeng

2019, 33(2): 43-50.

Abstract ( ) PDF ( )

Knowledge map

Save

The parallel corpus of Chinese and sign language construction is of significance in machine translation and contrastive language studies. The copus presented in this paper consists of the video of the sign language, information of the collectors and annotators, as well as 14 layers of labeling information via the multimedia labeling software ELAN (either manual or non-manual information). The cosine similarity based on VSM is adopted to reduce corpus deduplication. It is also used to test the similarity of the expert to ensure the quality of the corpus.

Select

Knowledge Representation and Acquisition

Knowledge Representation Learning for Joint Structural and Textual Embedding Via Attention-based CNN

PENG Min, YAO Yalan, XIE Qianqian, GAO Wang

2019, 33(2): 51-58.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge representation learning has attracted much attention in natural language processing with encouraging results especially on tasks such as Entity Linking, Relationship Extraction, Question Answering and so on. However, most of the existing models only use the structural information of knowledge graph and cannot handle new entities or entities with few facts very well. This paper proposes a joint knowledge representation model which utilizes both entity description and structural information. Firstly, we introduce convolutional neural network models to encode the entity description. Then, we design the attention mechanism to select the valid information of the text. Moreover, we introduce the position vector as the supplementary information. Finally, a gating mechanism is applied to integrate the structural and textual information into the joint representation. Experimental results show that our models outperform other baselines on link prediction and triplet classification tasks.

Select

Ethnic Language Processing and Cross Language Processing

Longest Noun Phrases Detection in Tibetan

LONG Congjun, LIU Huidan, ZHOU Maoke

2019, 33(2): 59-66.

Abstract ( ) PDF ( )

Knowledge map

Save

The longest noun phrases carry abundant syntactic and semantic information, corresponding to a syntactic components for most cased. By comparing the essence of the different longest noun phrases, this paper defines the longest noun phrase in Tibetan language from the perspective of syntactic tree. Total of 6 038 sentences are extracted from a Tibetan treebank, and the structure type, boundary feature and frequency of longest noun phrases are analyzed. Two approaches, the sequence annotation model and the parsing algorithm, are investigated to detect the longest noun phrases in Tibetan. Experiments proves the better performance of the sequence labeling approach, yielding 87.14% precision, 84.72% recall and 85.92% F-value respectively.

Select

Ethnic Language Processing and Cross Language Processing

Tibetan Interrogative Sentences Parsing Based on PCFG

BAN Mabao, CAI Zhijie, LAMA Zhaxi

2019, 33(2): 67-74.

Abstract ( ) PDF ( )

Knowledge map

Save

The syntax analysis of Tibetan interrogative sentences has broad application prospects such as in Tibetan question answering system, search engine, information extraction and retrieval. By analyzing the features of Tibetan interrogative sentences, this paper classified the Tibetan interrogative sentences and summarized the structural features of various Tibetan interrogative sentences. The PCFG method is utilized to parse the Tibetan interrogative sentences. The experiment reveals 96.0%, 95.4% and 95.7% in accuracy, recall and F value, respectively.

Select

Ethnic Language Processing and Cross Language Processing

Neural Network Based Tibetan Speech Synthesis

DOU Gecao, CAI Rangzhuoma, NAN Cuoji, SUAN Taiben

2019, 33(2): 75-80.

Abstract ( ) PDF ( )

Knowledge map

Save

Speech synthesis is one of the core technologies of human-computer interaction. With the development of neural network, the speech synthesis technology based on neural network has attracted more and more attention. After analyzing the structure and spelling rules of Tibetan characters, this paper studies Tibetan speech synthesis by combining Sequence to Sequence model and attention mechanism. The experimental results show that this method has good performance in the speech synthesis of Tibetan.

Select

Ethnic Language Processing and Cross Language Processing

Segmentation and Extraction Method for Manchu Words Based on Seam Craving

ZHANG Jing, XU Shuang, HE Jianjun, LI Min, ZHENG Ruirui

2019, 33(2): 81-88.

Abstract ( ) PDF ( )

Knowledge map

Save

An important step in the Manchu document analysis is segmentation and extraction Manchu words from large images of Manchu documents. The paper proposes a new Manchu word segmentation and extraction method based on seam craving. First of all, this paper detects the number of text lines by projection profile matching method, then paints them. Secondly, the minimum energy line is located by dynamic planning from bottom to top between adjacent text lines, and the best segmentation lines that don‘t cut through Manchu word components are determined by restraining the midline areas. Finally the independent Manchu text column and Manchu word is extracted according to the segmentation curve. Experimental results show that this method achieved better segmentation and extraction result on Manchu document image datasets.

Select

Information Extraction and Text Mining

Drug-Drug Interaction Extraction with the Attention Mechanism Over the Dependency

LI Lishuang, QIAN Shuang, ZHOU Anqiao, LIU Yang, GUO Yuankai

2019, 33(2): 89-96.

Abstract ( ) PDF ( )

Knowledge map

Save

Drug-Drug Interaction (DDI) extraction is an important issue in biomedical relationship extraction. Most of existing methods emphasize the key information such as entities and positions in the sentences. To further exploit the sentence structure, this paper proposes a Drug-Drug interaction extraction model based on the attention mechanism over the dependency. The correlation between the shortest dependency path and the sentence is measured to capture the useful information. Firstly, this model uses BiGRU network to learn the semantic information and context information of the original sentence and the Shortest Dependency Path (SDP) respectively. Secondly, the SDP information is incorporated into the original sentence information through the Attention mechanism. Finally, the final sentence representation is used to classify and predict DDI. This approach is evaluated on DDIExtraction 2013 corpus, yielding a micro F-scores of 73.72%.

Select

Sentiment Analysis and Social Computing

Analysis and Validation of Network Representation Algorithms

WANG Yan, TANG Jie

2019, 33(2): 97-104.

Abstract ( ) PDF ( )

Knowledge map

Save

The network representation learning algorithm is a popular issue in social network analysis, and this paper is to verify the existing network representation learning algorithms by network data with different structures. To evaluate the effect, the efficiency and the application limits of various algorithms, we choose the multi-label classification task of network nodes to compare ten algorithms of three categories on eight data sets. The experimental results show that Deep Learning algorithms like DeepWalk have stable and good performance on various types of networks, and the application of algorithms based on matrix factorization are limited by their high space complexity.

Select

Sentiment Analysis and Social Computing

Attention Enhanced Bi-directional LSTM for Sentiment Analysis

GUAN Pengfei, LI Bao‘an, LV Xueqiang, ZHOU Jianshe

2019, 33(2): 105-111.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with sentiment analysis at the sentence level, this paper proposes a method of attention enhanced Bi-directional LSTM. It employs attention mechanism to learn every word weight distribution of sentiment tendency directly from the word vector. Tested on the NLPCC 2014 sentiment analysis dataset, the results of the model outperfroms the other sentence level sentiment classification model.

Select

Sentiment Analysis and Social Computing

Fine-grained Opinion Mining Based on Feature Representation of Domain Sentiment Lexicon

YU Shengwei, LU Qi, CHEN Wenliang

2019, 33(2): 112-121.

Abstract ( ) PDF ( )

Knowledge map

Save

Fine-grained opinion mining aims at detecting sentiment units and determining sentiment polarity from opinion text. Recent methods are mostly based on sequence labeling models, rarely using the information of sentiment lexicon resources. This paper proposes a fine-grained opinion mining method based on feature representation of domain sentiment lexicon. It generates feature representation by using domain sentiment lexicon, applying it as the input of sequence labeling model. We build a new sentiment lexicon in E-commerce domain, and then we design feature representation of domain sentiment lexicon for CRF and BiLSTM-CRF. Experiments on E-commerce reviews show that our proposed method performs well on both models and outperforms the method based on other lexica.

Select

Sentiment Analysis and Social Computing

Shareholder‘s Portrait Construction in Stock Market

YU Hualei, RAO Yuan, TANG Caifang, REN Haoran

2019, 33(2): 122-130.

Abstract ( ) PDF ( )

Knowledge map

Save

The shareholder profile provide a new way of quick understanding of the real preference characteristics behind the shareholders‘ market behaviors, which is of significance in the investment decisions of external investors. The construction of shareholder portrait is especially meaningful considering the abnormal fluctuation of Chinese stock price caused by the frequent market behavior of the top ten circulating shareholders, in which they can always grasps the opportunity of trading perfectly. This paper analyzes the investment behavior of the ten circulating shareholders, constructs the shareholder portrait from two aspects: the activeness degree and preference characteristic. Moreover, the shareholders are further classified as individual, organization and fund. The completed portrait is designed to cover all aspects of the 3 kinds of shareholder. In addition, some methods of shareholder labeling are put forward, and some issues are discussed with solutions in dealing with shareholder‘s characteristics.

Select

NLP Application

Statistics and Analysis of Long Novels by Yu Hua and Mo Yan

TU Mengchun, LIU Ying

2019, 33(2): 131-142.

Abstract ( ) PDF ( )

Knowledge map

Save

This article uses long novels of Yu Hua and Mo Yan, five for each, as the corpus. The lengths of the paragraphs, sentences, clauses, color words, punctuation, part of speech and words, together with the n-grams are selected as the features. Statistically, clustering and k-s test are applied to judge the overall similarity of the two authors, and the Wilcoxon test is adopted to validate the difference between a specific characteristic between the two authors. After a detailed analysis, it is revealed that Mo Yan employs a larger vocabulary, showing strong emotions, ancient expressions and regionalisms, while Yu Hua assumes a calm and satirical style.

Please choose a citation manager

Content to export

2019 Volume 33 Issue 2 Published: 25 February 2019