2021 Volume 35 Issue 1 Published: 05 February 2021
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    JIA Yanyan, CHENG Xueqi, FENG Jian
    2021, 35(1): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    The discourse parsing is challenged by the long distance dependency issue. In contrast to the traditional manual feature-engineering strategy, a hierarchical discourse dependency parsing method based on LSTM is proposed in this paper. It decreases the number of element discourse units that the parser had to address at one-time. The experimental results on RST Discourse Treebank show that the performance of the proposed method outperforms other deep learning methods combined with certain features.
  • Language Analysis and Calculation
    ZHAO Ruizhuo, GAO Jinhua, SUN Xiaoqian, XU Li, SHEN Huawei, CHENG Xueqi
    2021, 35(1): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic parsing aims at mapping natural language utterances into machine interpretable logical forms. Deep neural models like encoder-decoder models, without the need of extensive feature engineering, have been applied to semantic parsing task and obtained promising results. Existing deep models for semantic parsing typically capture the compositional semantics of logical forms in a syntactic way by designing tree-structured decoders or adding grammar constraints to decoders. In this paper, we propose to generate tree-structured sketches to better capture compositional semantics of logical forms. Our model first predict the sketches in a top-town fashion, and then incorporates the sketch to generate the logical forms. Experimental results on three datasets show that our model generates sketches more accurately and achieves better performance at semantic parsing task.
  • Language Analysis and Calculation
    WANG Shan, WANG Huizhen
    2021, 35(1): 17-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Vocabulary growth research is based on the type-token-ratio (TTR) changes of the texts in different periods. This article selects Reports on the Work of the Chinese Government from 1954 to 2018, analyzes the curves of tokens and types in the texts, and explores the interaction between vocabulary richness of the reports and the policies. It first conducts Chinese word segmentation on the corpus and then selects the Heaps model for prediction according to different curve fitting effects. Taking China’s Five-Year Plan as the basic time cycle, the difference between the predicted value and the observed value of each cycle is compared with that of the random texts. The study reveals that vocabulary growth with time changes shows a certain tendency: in the period of deepening reforms and launching new policies, more words are needed to describe the phenomenon and the observed value is higher than the predicted value. With the analysis of the diachronic changes of Chinese texts, this paper provides references for the study of Chinese vocabulary growth.
  • Language Resources Construction
  • Language Resources Construction
    WANG Xing, SHAN Liqiu, HOU Lei, YU Jifan, CHEN Ji, TAO Mingyang
    2021, 35(1): 25-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Bilingual dictionary is a very important resource in natural language processing. In contrast to the classical methods of extracting bilingual dictionaries from parallel corpora or comparable corpora, the method based on partial bilingual corpus can avoid the limit of bilingual corpus by employing news or encyclopedia knowledge. This paper proposes a method of bilingual dictionary extraction by employing the structural characteristics of online encyclopedia. In addition to the text contents, this method designs five processes to extract encyclopedia corpus and achieves total 969 308 piece of information with duplication. Compared with the previous extraction methods, the number of bilingual information extracted from the online encyclopedia by this method is increased by 170.75%.
  • Language Resources Construction
    WANG Guirong, RAO Gaoqi, XUN Endong
    2021, 35(1): 34-42,53.
    Abstract ( ) PDF ( ) Knowledge map Save
    The word collocation knowledge is essential to both linguistic ontology and natural language processing tasks, in which verb-object collocation is distinguished by its syntax role, its quantity and its diversity. This paper constructs a Chinese verb-object knowledge base to provide basic knowledge based on a large scale corpus. It first summarizes the knowledge system of verb-object collocation from the perspective of linguistic ontology, and formulates 140 queries to retrieve verb-object instances from the BCC corpus. Finally, three million pairs of verb-object collocation are obtained after disambiguation.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    ZHAI Sheping, WANG Shuhuan, SHANG Dingrong, DONG Susu
    2021, 35(1): 43-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Knowledge representation learning aims to represent the entities and relations of a knowledge graphs in a continuous lowdimensional vector space. However, most of the existing models only use the structure information of triples and ignore the entity descriptions with rich semantic information. This paper proposes a joint representation based on entity descriptions (JRED). Specifically, this model introduces position vector and attention mechanism to design the Attention_BiLSTM encoder, which can dynamically select the most relevant information from the text descriptions according to different relations. At the same time, this paper adopts an adaptive representation method, which assigns different weights to every feature dimension. Based on this method, this model learns the joint representation of text and structure through the gate mechanism. The model is evaluated on the tasks of both link prediction and triple classification. Experimental results show that the model has made great progress in various indicators, especially under the Mean Rank indicator.
  • Knowledge Representation and Acquisition
    CHEN Xinyuan, XIE Shengyi, CHEN Qingqiang, LIU Yu
    2021, 35(1): 54-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    To enhance the knowledge base (KB) completion regarding complex relations or nodes with high indegree or outdegree, an algorithm called ATREC (algorithm based on Transitional Relation Embedding via CNN) is proposed. In this method, entities and relations from triplets are first mapped into low-dimensional vector spaces. After relational fusion, then features from different relations are integrated into heads and tails, thus forming fused triplet representations. These representations of triplets are concatenated with original representations, forming a 6-column, k-dimensional matrices which serves as the input for convolution neural network (CNN). Experiments show that ATREC performs better than some state-of-the-art models especially when scaling up to relatively larger datasets and on relations with high cardinalities.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    JIAO Liying, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi
    2021, 35(1): 64-71.
    Abstract ( ) PDF ( ) Knowledge map Save
    Single document summarization is a process of compressing a document into a short description. For this purpose, this paper proposes a headline generation algorithm for a single document guided by the key information. In addition to the first paragraph of the news used in the mainstream method, the key information in the algorithm includes sentences with substantive information in the following contents, as well as key words in the news. The algorithm uses the key information as input of the sequence model to generate a title so that the generated headline can cover more news information. Experiments show that using key information can improve the effect of news headline generation based on sequence models.
  • Information Extraction and Text Mining
    WANG Xiaoyue, LI Ru, DUAN Fei
    2021, 35(1): 72-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    To further improve gated convolution neural network for Chinese named entity recognition (NER) and resolve the gradient vanishing occurred in the stacking multiple convolution layers, adopt recurrent architecture, in particular long short-term memory networks (LSTM). Because of the recurrent nature of those methods, the parallel computing capability of GPU cannot be utilized at their full potential. Although vanilla 1-D convolution operation can be adopted to process texts in parallel, stacking multiple layers are often necessary to obtain satisfactory receptive fields so as to better model long-range dependencies in texts. we propose to replace vanilla convolution operations with recently proposed dilated convolution operations whose receptive fields can be controlled via a dilated factor. To further strengthen effective information and reduce the negative impacts induced by invalid information, we propose a gating mechanism with residual connections. To enrich textual features, we also fuse the character embedding with the word position information. Validated on MSRA dataset and Sina Resume dataset, the results show that, compared with conventional Bi-LSTM-CRF models, our proposed method shows very promising performance, as well as 5x~6x speedup in the training phase compared to RNN architecture.
  • Information Extraction and Text Mining
    ZHAO Linling, WANG Suge, CHEN Xin, WANG Dian, ZHANG Zhaobin
    2021, 35(1): 81-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    Simile is the most common form in the metaphor, including obvious comparators, such as "like", used to relate tenor and vehicle. To better resolve the Chinese prose reading comprehension of the College Entrance Examination, this paper designs a method for the simile recognition and component extraction based on part-of-speech features. Firstly, the vector representation of the words in the sentence is merged with the representation of the part-of-speech. Then, the fused vector is input into BiLSTM model and the global optimal annotation sequence is decoded by CRF. Finally, the smile recognition and component extraction are generated according to annotated sequence. The experiment results show that the proposed method is better than the existing single task method on the open dataset.
  • Information Extraction and Text Mining
    LI Xunyu, MAO Cunli, YU Zhengtao, GAO Shengxiang, WANG Zhenhan, ZHANG Yafei
    2021, 35(1): 88-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    To collect Chinese-Burmese comparable documents, this paper proposes a acquisition method based on topic model and bilingual word embedding, treating the cross-language document similarity issue as cross-language topic similarity measurement. First, we use the monolingual LDA topic model to extract the Chinese and Burmese topics, respectively, and get the corresponding topics distribution representation. Then, we calculate the topic words for Chinese and Burmese documents, and get the Chinese-Burmese bilingual topic word embedding by mapping the monolingual word embedding into a shared semantic space according the Chinese-Burmese bilingual dictionary. The similarity of Chinese and Burmese document is finally decided by both topic embedding and bilingual word embedding. The experimental results show that the F1 obtained by the proposed method is increased by 5.6% than the baseline using just the word embedding.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    WANG Pengyu, ZHANG Min, MA Weizhi, LIU Yiqun, MA Shaoping
    2021, 35(1): 96-103,112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Depression is increasingly becoming an important factor affecting the happiness of modern people. The real time and effective identification of emotions is of great significance for the discovery and treatment of potential patients with depression. This paper analyzes the characteristics of the user’s life log data collected by wearable devices. Further, in this paper, we apply an ensemble learning model with regression tree as weak learner on three groups of experiments designed to use all data, only user’s own data and only others data for training, respectively. The experimental results show that the ensemble learning model based on life log data can effectively identify the emotional state of users. Meanwhile, the experimental results indicate a conjecture that users’ cognition is inconsistent, serving as a potential inspiration for psychological depression analysis as well.
  • Sentiment Analysis and Social Computing
    CHEN Wei, LIN Xuejian, YIN Zhong
    2021, 35(1): 104-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, muliti-label-classification-task(MLC) has been widely concerned. Traditional sentiment analysis is regarded as a single label supervised learning, while ignoring the problem that multiple sentiments may coexist in the same instance. This paper proposes a multi-label sentiment analysis model (Label-CNN_LSTM_Attention,L-CLA) to fuse the labels via neural network. With the Word2Vec as the input, CNN and LSTM are combined, with the CNN layer dealing with deep word features in the text and the LSTM layer capturing the long-term dependence between words. The attention mechanism is adopted to assign higher weight to the affective words, and the label correlation matrix is integrated to pad the label feature vector as part of the input. The experimental results show that the L-CLA model has a good classification effect on the retagged NLP & CC2013 data set.
  • Sentiment Analysis and Social Computing
    FAN Xiaobing, RAO Yuan, WANG Shuo, LI Ruixiang, LIU Xuhui
    2021, 35(1): 113-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    The massive, disorderly and fragmented news data in the social network makes it impossible for people to perceive news event details from a multi-dimensional perspective. To address this issue, this paper proposes a named entity sensitive generation of hierarchical news story line, so as to form a hierarchical and multi-view event context development without supervision. Firstly, the event is detected based on the combination of event topic information and implicit semantic information; Then the community detection algorithm based on multi-dimensional semantics is applied to divide the event into multiple clusters, with each cluster as a sub-event. Finally the event storyline is constructed from the multi-view of information. Experimental results on real-world dataset demonstrate that the proposed method outperforms the baseline method in each step, with increases in terms of acceptability , generality and correctness by of 0.44, 0.11 and 0.50, respectively.
  • NLP Application
  • NLP Application
    LIU Daowen, RUAN Tong, ZHANG Chentong, QIU Jiahui, ZHAI Jie, HE Ping, GE Xiaoling
    2021, 35(1): 125-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    The clinical department recommendation is a challenging task since the settings of department are different among hospitals. Meanwhile the relationships between symptoms and departments are also unclear. In this paper, weighted knowledge graph is defined and constructed from local EHR data, ICD (International Classification of Diseases) and online medical websites to establish the quantitative relationship among symptoms, diseases and departments. The constructed knowledge graph contains 38 hospitals, 6 110 departments, 6 220 symptoms and 60 736 symptoms-related diseases. The proposed recommendation system recognizes the symptoms words, disease words and body part words in patients’ chief complaint by a Bert entity recognition model. Finally, a weight-based disease prediction algorithm based on multiple symptoms (WBDPMS) is designed to identify the candidate diseases and thus recommend the most suitable hospitals and departments. The experimental results show that the accuracy reaches 0.88.
  • NLP Application
    CAO Yang, CAO Cungen, WANG Shi
    2021, 35(1): 135-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Typo automatic detection is an important research task in natural language processing. It has important value in search engine, automated Q & A, etc. Although the accuracy of traditional methods for recognizing muliti-word typos in Chinese text is relatively high. However, due to the particularity of Chinese single word error, these methods generally have low accuracy. This paper proposes a method to identify Chinese single word error using a Transformer network. Firstly, In this paper, we make full use of Chinese character confusion set and web pages to build a Chinese single word error training corpus. Secondly, during actual testing process, the sliding window method is adopted for the actual sentences to be identified, single word error detection is performed for each sentence segment in each sliding window, and the recognition results of each window are comprehensively considered. Experiments show that the method in this paper has better practicability. Experimental results indicate that our method achieves a precision rate of 83.6% and a recall rate of 65.7% on an artifical test set, and a precision rate of 82.8% and a recall rate of 61.4% respectively on a real test set.