2020 Volume 34 Issue 8 Published: 18 September 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    LIN Xingxing, QIU Xiaofeng, LIU Yang, YU Mengxia, QI Jing, KANG Sichen
    2020, 34(8): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural network language models have many applications without much interpretations. An important and direct aspect of its interpretability is the association between word embedding vectors and linguistic features. The previous work of interpretability focuses on the knowledge injection to corpus-based word embedding and the theoretical analysis of training models, without direct verification and discussion on the correlation between word embedding vectors and linguistic features. In this paper, the pseudo-corpus derived from knowledge bases is applied. Some preliminary findings include: 1) it is feasible to inject semantic features into the word embedding vectors under control; 2) the compositionality of the word embedding vectors, i.e. the upper concept can be represented by the lower concepts, is observed with injected linguistic features; 3) the injection of semantic features is reflected in all dimensions of word embedding vectors.
  • Language Analysis and Calculation
    SUN Qixin, RAO Gaoqi, XUN Endong
    2020, 34(8): 10-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper collected a diachronic corpus of Chinese newspapers and periodicals for the past 144 years dated back to the late Qing Dynasty. A study on word semantic evolution computation is conducted for Chinese via statistical analysis and word distributed representation. Chinese word with potential semantic evolution is first discovered by context overlapping of content words via TF-IDF, word frequency ratio and other statistical indicators. Then, to align the word embeddings derived from corpus of different time periods, three methods are examined: orthogonal matrix alignment after SGNS training, second-order word vector representation and SGNS incremental training (which bears top performance). Finally, the word semantic evolution is identified by the diachronic self-similarity of the candidate word and the diachronic similarity of anchor words, with neighboring words as the description of the word meaning in the evolution.
  • Language Analysis and Calculation
    KANG Sichen, YU Mengxia, LIU Yang
    2020, 34(8): 23-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    Knowledge representation and prediction of Chinese unknown words, including parts of speech, word-formation structure and word meaning, is a fundamental issue in computational linguistics. According to the principle of Parallel Circumference, this paper extracts Parallel Conditions from the existing semantic word-formation know-ledge, and identifies the candidate unknown words with these word-formation factors. This method applies this linguistic theory with the identification unknown words, achieving better explanatory ability, convenience and precision. These study are expected to promote the progress of computational lexicography, language research and teaching and other humanities fields.
  • Language Resources Construction
  • Language Resources Construction
    ZHANG Chenlin, WANG Mingwen, TAN Yiming, XIAO Wenyan
    2020, 34(8): 32-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Euphemism is an indispensable method of language communication. It has always been one of the hottest issue in linguistics. However, this issues is hardly addressed in natural language processing community. In this paper, a corpus of euphemism (about 63,000 sentences) is collected and identified manually, with a reference to existing dictionaries. According to the dictionaries’ definition and the requirements of the related natural language processing work, euphemisms are classified at the semantic level. With the collected corpus and classification, we attempted to identify polysemous euphemisms automatically and achieved an accuracy of 89.71% for simple euphemisms and 74.65% for complex ones.
  • Language Resources Construction
    FENG Luanluan, LI Junhui, LI Peifeng, ZHU Qiaoming
    2020, 34(8): 41-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    Massive literature and science information on Internet can supply valuable intelligence. The detection of technology and terminology is fundamental for constructing oriented national defense science (ONDS) technology knowledge base. We analyze military text characteristics and design annotation guidelines for ONDS technology and terminology from massive internet content for a list of military emerging technology defined in Wikipedia. Based on the annotation guidelines, we conduct broad-scale corpus annotation process, and we construct a ONDS technology and terminology corpus which covers three genres of news, papers and Wikipedia. we finally annotated 479 articles with 24,487 sentences and 33,756 technologies and terminologies. Meanwhile, we explore the feasibility of model pre-annotating, analyze distribution of technology and terminology in different genres and calculate annotation consistency for the corpus. Experiment result based on the corpus show that the detection of technology and terminology achieves 70.40% F1 scores. The work presented in this paper builds foundations for detection of ONDS technology and terminology.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    CHEN Jiali, HONG Yu, WANG Jie, ZHANG Jingli, YAO Jianmin
    2020, 34(8): 51-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentence-level Event Detection (ED) is a task of identifying and classifying event triggers. Existing approaches mainly use sentences as the input of the neural classification network and learn the deep semantic information of sentences. Base on the fact that the dependency tree contains rich syntactic structure features for more accurate sentence representation, we use a Bidirectional Long Short-Term Memory (Bi-LSTM) to learn semantic information, and use a Graph Convolutional Network (GCN) to learn dependency information. To concentrate more on event-related information and reduce the interference of redundant words, we add self-attention on the Bi-LSTM and GCN respectively. Finally, we propose to use the gated mechanism to dynamically fuse semantic information and dependency information. The experimental results on ACE show that the performance of the proposed method reaches 76.3% and 73.9% in F1-score for trigger identification and event type classification, respectively.
  • Information Extraction and Text Mining
    LI Yuan, MA Lei, SHAO Dangguo, YUAN Meiyu, ZHANG Mingfang
    2020, 34(8): 61-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese named entity recognition (NER) in social media is a challenging task. Existing methods based on word-level information or external knowledge are affected by Chinese word segmentation (CWS) and Out-of-Vocabulary (OOV). This paper proposes an adversarial learning model based on character using positional encoding and multi-attention. The combination of positional encoding and self-attention can better capture the dependence of character sequences, while the use of spatial attention discriminator can improve the extraction effect of external knowledge. The experimental results show that the proposed approach achieves 56.79% and 60.62% in F-score, respectively, on the datasets in Weibo2015 and Weibo2017.
  • Information Extraction and Text Mining
    CHEN Ru, LU Xianling
    2020, 34(8): 70-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    The IDC-HSAN (Iterated Dilated Convolutions Neural Networks and Hierarchical Self-attention Network)model is constructed for Chinese named entity recognition to deal with the hierarchical text structure and the deficiency in computation of RNN. The model enable the parallel computation ion GPU and reduce the time cost of LSTM significantly. The hierarchical self-attention mechanism is applied to capture local and global semantic information. In addition, the radical information is also employed to enrich the embedded information. The experimental results show that this model can identify entities better than the classical deep model with the attention mechanism.
  • Question-answering and Dialogue
  • Question-answering and Dialogue
    WANG Mengyu, YU Dingyao, YAN Rui, HU Wenpeng, ZHAO Dongyan
    2020, 34(8): 78-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-turn dialogue task requires the system to take care of context information while generating fluent answers. Recently, a large number of multi-turn dialogue models based on HRED(Hierarchical Recurrent Encoder-Decoder) model have been developed, reporting good results on some English dialogue datasets such as Movie-DiC. On a high-quality customer service dialogue corpus from real world to contestants released by Jingdong in 2018, this article investigates the performance of HRED model and explores possible improvements. It is revealed that the combination of the attention and ResNet mechanisms with HRED model can achieve significant improvements.
  • Information Retrieval and Question-answering System
  • Information Retrieval and Question-answering System
    CAO Junmei, MA Lerong
    2020, 34(8): 86-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Re-ranking retrieved documents are usually required to further improve the performance in many information retrieval tasks. In this paper, we conduct multi-channel deep convolutional neural networks (CNNs) on listwise approaches for learning to rank, namely ListCNN. For the multi-modal features extracted from documents, we find that some features are locally correlated with redundancy. Accordingly, we propose to employ deep neural networks (i.e., modified CNNs) to re-extract features to boost the performance of classical listwise approaches. Validated on public datasets of LETOR 4.0, the proposed ListCNN architecture demonstrates superior performance for re-ranking in comparison with other state-of-the-arts methods.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    CHEN Bo, XIE Jun, MIAO Duoqian, WANG Yuzhu, XU Xinying
    2020, 34(8): 94-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    Rough set is a mathematical tool that can greatly reduce the dimension and number of text sentiment word features while keeping the ability of text sentiment classification unchanged. Aiming at the problem that the text sentiment word feature dimension is too high and the sentiment word feature representation lacks semantic information, this article proposes a novel Chinese text sentiment word feature representation method named RS-WvGv. The decision table of rough set is used to model the text sentiment word feature of the whole corpus. The Johnson attribute reduction algorithm is applied to simplify the decision table and get the minimum set of text sentiment word feature attributes. And then based on the word embedding of all the sentiment feature words in the set, the RS-WvGv method is verified with logistic regression classifier in the experiment.
  • Sentiment Analysis and Social Computing
    ZHANG Weisheng, WANG Zhongqing, LI Shoushan, ZHOU Guodong
    2020, 34(8): 105-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    A correlation is usually exist between speaker’s sentiment and act in daily dialogs, which could also be reflected in the dialogue structure. Therefore, we propose a joint model to classify the sentiment and act in each utterance by using the dialog structure. Moreover, we use the attention mechanism to capture the impact of the structure of dialog on the sentiment of each utterance. Experiments show that the proposed model outperforms the state-of-the-art models in both dialog sentiment classification and act classification.