Journal of Chinese Information Processing

Select

Language Analysis and Calculation

A Study of Knowledge Motivated Explainalbe Word Embedding Vector

LIN Xingxing, QIU Xiaofeng, LIU Yang, YU Mengxia, QI Jing, KANG Sichen

2020, 34(8): 1-9.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural network language models have many applications without much interpretations. An important and direct aspect of its interpretability is the association between word embedding vectors and linguistic features. The previous work of interpretability focuses on the knowledge injection to corpus-based word embedding and the theoretical analysis of training models, without direct verification and discussion on the correlation between word embedding vectors and linguistic features. In this paper, the pseudo-corpus derived from knowledge bases is applied. Some preliminary findings include: 1) it is feasible to inject semantic features into the word embedding vectors under control; 2) the compositionality of the word embedding vectors, i.e. the upper concept can be represented by the lower concepts, is observed with injected linguistic features; 3) the injection of semantic features is reflected in all dimensions of word embedding vectors.

Select

Language Analysis and Calculation

A Study on Semantic Evolution Computation with Diachronic Corpus

SUN Qixin, RAO Gaoqi, XUN Endong

2020, 34(8): 10-22.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper collected a diachronic corpus of Chinese newspapers and periodicals for the past 144 years dated back to the late Qing Dynasty. A study on word semantic evolution computation is conducted for Chinese via statistical analysis and word distributed representation. Chinese word with potential semantic evolution is first discovered by context overlapping of content words via TF-IDF, word frequency ratio and other statistical indicators. Then, to align the word embeddings derived from corpus of different time periods, three methods are examined: orthogonal matrix alignment after SGNS training, second-order word vector representation and SGNS incremental training (which bears top performance). Finally, the word semantic evolution is identified by the diachronic self-similarity of the candidate word and the diachronic similarity of anchor words, with neighboring words as the description of the word meaning in the evolution.

Select

Language Analysis and Calculation

Knowledge Representation and Prediction of Chinese Unknown Words via Parallel Conditions

KANG Sichen, YU Mengxia, LIU Yang

2020, 34(8): 23-31.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge representation and prediction of Chinese unknown words, including parts of speech, word-formation structure and word meaning, is a fundamental issue in computational linguistics. According to the principle of Parallel Circumference, this paper extracts Parallel Conditions from the existing semantic word-formation know-ledge, and identifies the candidate unknown words with these word-formation factors. This method applies this linguistic theory with the identification unknown words, achieving better explanatory ability, convenience and precision. These study are expected to promote the progress of computational lexicography, language research and teaching and other humanities fields.

Select

Language Resources Construction

Construction of Chinese Euphemism Resources

ZHANG Chenlin, WANG Mingwen, TAN Yiming, XIAO Wenyan

2020, 34(8): 32-40.

Abstract ( ) PDF ( )

Knowledge map

Save

Euphemism is an indispensable method of language communication. It has always been one of the hottest issue in linguistics. However, this issues is hardly addressed in natural language processing community. In this paper, a corpus of euphemism (about 63,000 sentences) is collected and identified manually, with a reference to existing dictionaries. According to the dictionaries’ definition and the requirements of the related natural language processing work, euphemisms are classified at the semantic level. With the collected corpus and classification, we attempted to identify polysemous euphemisms automatically and achieved an accuracy of 89.71% for simple euphemisms and 74.65% for complex ones.

Select

Language Resources Construction

Constructing a Technology and Terminology Corpus Oriented National Defense Science

FENG Luanluan, LI Junhui, LI Peifeng, ZHU Qiaoming

2020, 34(8): 41-50.

Abstract ( ) PDF ( )

Knowledge map

Save

Massive literature and science information on Internet can supply valuable intelligence. The detection of technology and terminology is fundamental for constructing oriented national defense science (ONDS) technology knowledge base. We analyze military text characteristics and design annotation guidelines for ONDS technology and terminology from massive internet content for a list of military emerging technology defined in Wikipedia. Based on the annotation guidelines, we conduct broad-scale corpus annotation process, and we construct a ONDS technology and terminology corpus which covers three genres of news, papers and Wikipedia. we finally annotated 479 articles with 24,487 sentences and 33,756 technologies and terminologies. Meanwhile, we explore the feasibility of model pre-annotating, analyze distribution of technology and terminology in different genres and calculate annotation consistency for the corpus. Experiment result based on the corpus show that the detection of technology and terminology achieves 70.40% F₁ scores. The work presented in this paper builds foundations for detection of ONDS technology and terminology.

Select

Information Extraction and Text Mining

Combination of Dependency and Semantic Information via Gated Mechanism for Event Detection

CHEN Jiali, HONG Yu, WANG Jie, ZHANG Jingli, YAO Jianmin

2020, 34(8): 51-60.

Abstract ( ) PDF ( )

Knowledge map

Save

Sentence-level Event Detection (ED) is a task of identifying and classifying event triggers. Existing approaches mainly use sentences as the input of the neural classification network and learn the deep semantic information of sentences. Base on the fact that the dependency tree contains rich syntactic structure features for more accurate sentence representation, we use a Bidirectional Long Short-Term Memory (Bi-LSTM) to learn semantic information, and use a Graph Convolutional Network (GCN) to learn dependency information. To concentrate more on event-related information and reduce the interference of redundant words, we add self-attention on the Bi-LSTM and GCN respectively. Finally, we propose to use the gated mechanism to dynamically fuse semantic information and dependency information. The experimental results on ACE show that the performance of the proposed method reaches 76.3% and 73.9% in F₁-score for trigger identification and event type classification, respectively.

Select

Information Extraction and Text Mining

Chinese Named Entity Recognition for Social Media

LI Yuan, MA Lei, SHAO Dangguo, YUAN Meiyu, ZHANG Mingfang

2020, 34(8): 61-69.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese named entity recognition (NER) in social media is a challenging task. Existing methods based on word-level information or external knowledge are affected by Chinese word segmentation (CWS) and Out-of-Vocabulary (OOV). This paper proposes an adversarial learning model based on character using positional encoding and multi-attention. The combination of positional encoding and self-attention can better capture the dependence of character sequences, while the use of spatial attention discriminator can improve the extraction effect of external knowledge. The experimental results show that the proposed approach achieves 56.79% and 60.62% in F-score, respectively, on the datasets in Weibo2015 and Weibo2017.

Select

Information Extraction and Text Mining

Combing Iterated Dilated Convolutions Neural Network and Hierarchical Attention Network for Chinese Named Entity Recognition

CHEN Ru, LU Xianling

2020, 34(8): 70-77.

Abstract ( ) PDF ( )

Knowledge map

Save

The IDC-HSAN (Iterated Dilated Convolutions Neural Networks and Hierarchical Self-attention Network)model is constructed for Chinese named entity recognition to deal with the hierarchical text structure and the deficiency in computation of RNN. The model enable the parallel computation ion GPU and reduce the time cost of LSTM significantly. The hierarchical self-attention mechanism is applied to capture local and global semantic information. In addition, the radical information is also employed to enrich the embedded information. The experimental results show that this model can identify entities better than the classical deep model with the attention mechanism.

Select

Question-answering and Dialogue

Chinese Multi-turn Dialogue Tasks Based on HERD Model

WANG Mengyu, YU Dingyao, YAN Rui, HU Wenpeng, ZHAO Dongyan

2020, 34(8): 78-85.

Abstract ( ) PDF ( )

Knowledge map

Save

Multi-turn dialogue task requires the system to take care of context information while generating fluent answers. Recently, a large number of multi-turn dialogue models based on HRED(Hierarchical Recurrent Encoder-Decoder) model have been developed, reporting good results on some English dialogue datasets such as Movie-DiC. On a high-quality customer service dialogue corpus from real world to contestants released by Jingdong in 2018, this article investigates the performance of HRED model and explores possible improvements. It is revealed that the combination of the attention and ResNet mechanisms with HRED model can achieve significant improvements.

Select

Information Retrieval and Question-answering System

Listwise Reranking via Convolutional Re-extracted Features

CAO Junmei, MA Lerong

2020, 34(8): 86-93.

Abstract ( ) PDF ( )

Knowledge map

Save

Re-ranking retrieved documents are usually required to further improve the performance in many information retrieval tasks. In this paper, we conduct multi-channel deep convolutional neural networks (CNNs) on listwise approaches for learning to rank, namely ListCNN. For the multi-modal features extracted from documents, we find that some features are locally correlated with redundancy. Accordingly, we propose to employ deep neural networks (i.e., modified CNNs) to re-extract features to boost the performance of classical listwise approaches. Validated on public datasets of LETOR 4.0, the proposed ListCNN architecture demonstrates superior performance for re-ranking in comparison with other state-of-the-arts methods.

Select

Sentiment Analysis and Social Computing

Chinese Text Sentiment Feature Analysis Based on Rough Set and Multi Channel Word Vector

CHEN Bo, XIE Jun, MIAO Duoqian, WANG Yuzhu, XU Xinying

2020, 34(8): 94-104.

Abstract ( ) PDF ( )

Knowledge map

Save

Rough set is a mathematical tool that can greatly reduce the dimension and number of text sentiment word features while keeping the ability of text sentiment classification unchanged. Aiming at the problem that the text sentiment word feature dimension is too high and the sentiment word feature representation lacks semantic information, this article proposes a novel Chinese text sentiment word feature representation method named RS-WvGv. The decision table of rough set is used to model the text sentiment word feature of the whole corpus. The Johnson attribute reduction algorithm is applied to simplify the decision table and get the minimum set of text sentiment word feature attributes. And then based on the word embedding of all the sentiment feature words in the set, the RS-WvGv method is verified with logistic regression classifier in the experiment.

Select

Sentiment Analysis and Social Computing

Joint Model for Sentiment and Act Classification Using Dialog Structure

ZHANG Weisheng, WANG Zhongqing, LI Shoushan, ZHOU Guodong

2020, 34(8): 105-112.

Abstract ( ) PDF ( )

Knowledge map

Save

A correlation is usually exist between speaker’s sentiment and act in daily dialogs, which could also be reflected in the dialogue structure. Therefore, we propose a joint model to classify the sentiment and act in each utterance by using the dialog structure. Moreover, we use the attention mechanism to capture the impact of the structure of dialog on the sentiment of each utterance. Experiments show that the proposed model outperforms the state-of-the-art models in both dialog sentiment classification and act classification.

Please choose a citation manager

Content to export

2020 Volume 34 Issue 8 Published: 18 September 2020