2020 Volume 34 Issue 6 Published: 15 July 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    YU Jingsong, WEI Yi, ZHANG Yongwei, YANG Hao
    2020, 34(6): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    All the Chinese characters in ancient Chinese texts are written continuously, without obvious segmentation marks between words. This brings great challenges to text understanding and even cultural inheritance. To deal with word segmentation for ancient Chinese texts, we propose the Multi-Stage Iterative Training (MSIT) for unsupervised word segmentation by combining non-parametric Bayesian models with BERT(Bidirectional Encoder Representations from Transformers). It achieves the F1 score of 93.28% on Zuozhuan (an ancient Chinese history book) dataset. After adding only 500 ground truth sentences, which can be considered as weakly supervised learning, the F1 score reaches 95.55%. It outperforms the previous best result, which trains on 6/7 of the Zuozhuan dataset (about 36,000 ground truth sentences). When using the same training set, our method gets the F1 score of 97.40%, the state-of-the-art result. Our proposed method is not only better than traditional sequence labeling algorithms including BERT model, but also proved that it has better generalization ability by experiments. The model and related codes are available online.
  • Language Analysis and Calculation
    SUN Kaili, DENG Dunhua, LI Yuan, LI Miao, LI Yang
    2020, 34(6): 9-17,26.
    Abstract ( ) PDF ( ) Knowledge map Save
    Compound sentence relation recognition is to identify for the semantic relation of clauses, which is the key task in semantic analysis of compound sentences. This task is difficult due to the an implicit relation in non-saturate compound sentences. To deal with the implicit semantic information, a multi-channel CNN based on the inner-attention mechanism is proposed in this paper. The inner-attention mechanism is based on Bi-LSTM, which enables it to learn bidirectional semantic features and associated features between clauses. At the same time, CNN is used to model the sentence representation to obtain local features. Compared with other results, experiment results on the CCCS and TCT show that the macro-F1 score and the average recall score of this paper reach 85.61% and 84.87%, achieving 6.08% and 3.05% relative improvement, respectively.
  • Language Analysis and Calculation
    WANG Shaojing, LIU Pengfei, QIU Xipeng
    2020, 34(6): 18-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aims at the problem of labeling multiple sequence labels in the same sentence, we propose a new sequence graph model. The sequence graph model is to capture two main kinds of dependencies: one is the relationship between the time series dimensions of different words, and the other is to unify the dependence of words on different tasks. We adopt LSTM or Transformer-like structure to model information interactions in a time series dimension. And we use attention mechanism at each step to model the interaction between different tasks and obtain a better representation of each word. The experimental results show that our model can not only achieve better performance at Ontonotes 5.0, but also can recover interpretable structures between different task labels.
  • Language Resources Constraction
  • Language Resources Constraction
    GE Shili, SONG Rou
    2020, 34(6): 27-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    English-Chinese clause alignment corpus serves the study and application of grammatical structure correspondence between English and Chinese clauses. It is of great significance to linguistic theory and language translation (including human translation and machine translation). Previous work on grammar theory and corpus lacks sufficient research on definitions of clause and clause complex. It is theoretically defective and difficult to support the application of natural language processing. Firstly, this paper makes theoretical preparations for the construction of English-Chinese clause alignment corpus. Starting from the theory of Chinese clause complex put forward in recent years, this paper defines the concept of component sharing, and further defines English clause and clause complex based on naming sharing and quotation sharing, which endows clause and clause complex with integrity and unity. Based on the study, an English-Chinese clause alignment annotation system is designed, including English NT clause tagging and Chinese translation generation and combination. The corpus annotation shows that, at the clause complex level, the components involved by the structural transformation in English-Chinese translation can be limited to English clauses, and related naming and telling, without involving the internal structure of namings and tellings. Based on these works, the English-Chinese clause aligned corpus provides research samples for linguistic research, English-Chinese language comparison, and English-Chinese machine translation.
  • Language Resources Constraction
    ZHANG Kunli, ZHAO Xu, GUAN Tongfeng, SHANG Baiyu, LI Yumeng, ZAN Hongying
    2020, 34(6): 36-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    The medical text is an important data foundation for the implementation of intelligent healthcare. As a kind of semi-structured or unstructured data, the medical text needs to be labeled for entity and entity relationships, paving the way for text structuring, named entity recognition, and automatic relationship extraction. Aimed at constructing the Chinese medical knowledge graph, a semi-automated entity and relationship labeling platform is designed to integrate multiple algorithms for pre-labeling, schedule control, quality control and data analysis. Based on this platform, the medical knowledge graph entity and relationship labeling are carried out. The results show that the labeling platform can control the labeling process in the construction of text resources, ensure the labeling quality, and improve the labeling efficiency.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    REN Ming, XU Guang, WANG Wenxiang
    2020, 34(6): 45-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to organize genealogy resources efficiently, it is necessary to extract entities and their relationships from unstructured genealogy text and build a structured representation. The extraction of entities and the relationships is often transformed to sequence tagging task. Given the high density of entities, relationships and the overlapping relationships, this paper proposes a conceptual model to guide the extraction. Then the commonly-used deep learning models for sequence tagging are tested and compared on a real dataset. Experimental results show that BERT-BiLSTM-CRF outperforms the others in terms of precision, recall and F1 score, and the proposed method is effective in extracting entities and relationships from genealogy text.
  • Information Extraction and Text Mining
    YANG Yifan, CHEN Wenliang
    2020, 34(6): 55-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, the Internet contains a large amount of entity introduction texts, which provides a resource basis for the construction of entity knowledge. As an attribute of entity, an alias is a different expression of the official name of an entity with great significance in knowledge graphs. In this paper, the introduction text of the attraction is used as a corpus, and the alias annotation strategy is proposed with the combination of different alias description methods. Alias extraction can be divided into two subtasks: entity recognition and relation classification. This paper proposes a joint model of scenic entity alias extraction based on deep learning, and completes two subtasks simultaneously. The experimental results on the data set constructed in this paper show that the performance of the joint model is significantly improved compared with the pipelined model.
  • Information Extraction and Text Mining
    MA Jin, YANG Yifan, CHEN Wenliang
    2020, 34(6): 64-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Attribute recognition is aimed at obtaining attribute values of entities from unstructured text. In order to extract person attributes from text, a large amount of annotated data is usually needed, which is not availabel so far. To address this issue, we use Infobox of encyclopedia web pages to construct the tuples of person attributes, and then apply distant supervision method to obtain large-scale and multi-category annotated datasets for person attributes, thus avoiding the tedious process of manual annotation. Additionally, we present two kinds of models based on CRF and BiLSTM-CRF for person attribute recognition as the baseline systems. The experimental results show that BiLSTM-CRF performs better than CRF on this newly built dataset.
  • Information Extraction and Text Mining
    XIAN Yantuan, XIANG Yan, YU Zhengtao, WEN Yonghua, WANG Hongbin, ZHANG Yafei
    2020, 34(6): 73-80,88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text classification is a fundamental issue of natural language processing. Based on the prototypical networks, this paper proposes a mean prototype network by an integrating different time steps prototype vectors through moving average, and then combining the mean prototype network with a simple RNN to propose a novel text classification model. The model uses a single-layer RNN to learn the vector representation of text, and learns categories vector representation by the mean prototype networks. The model applies the distance between the text vector and the prototype vector to train the model and predict the text category. Compared with the existing neural text classification method, the model is featured by the shallow depth and fewer parameters, and the introduction of similarity between samples in training and prediction process. The proposed method achieves state-of-the-art results on five benchmark datasets for text classification.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    TAN Hongye, QU Baoxing
    2020, 34(6): 81-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine reading comprehension (MRC) enables the machine read a given passage and then answer some relevant questions. A number of data sets and models have been proposed for a specific type of problems, without dealing with the diversity of problems in real-world. In this paper, we propose a multi-task reading comprehension model based on Bert. It uses the attention mechanism to obtain multi representations of questions and passages and then classify the questions. Then the model utilizes the classification results to answer the various questions. Experiments on Chinese public machine reading comprehension dataset CAIL2019-CJRC show that our system achieves better results than all the baseline models.
  • Machine Reading Comprehension
    ZHANG Zhaobin, WANG Suge, CHEN Xin, ZHAO Linling, WANG Dian
    2020, 34(6): 89-96,105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Among the Chinese reading comprehension of the college entrance examination, the opinion questions are rich in abastract viewpoint expressions. In order to obtain the answer information related to the questions from the reading materials, the abstract words in the questions need to be expanded, resulting an expansion of the opinion questions. This paper proposes a question expansion modeling method with the multi-task hierarchical Long Short-Term Memory (Multi-HLSTM). First, the reading materials and the questions are connected with attention mechanism. At the same time, the two tasks of the questions prediction and the answers prediction are modeled to further expand the questions. Finally, the extended questions and the original questions are applied to extract the candidate sentences of the questions as the answers. On the data sets of opinion questions reading comprehensions of the Chinese college entrance examination, its related simulation test and the datasets of description and opinion type of DuReader, the experimental results show that the proposed question expansion model is effective on the extraction of candidate sentences.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    YUAN Tao, NIU Shuzi, LI Huiyuan
    2020, 34(6): 97-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sequential recommendation attempts to use the historical interaction sequence between users and items to predict the next item to interact with. A multi-scale temporal dynamic model for sequential recommendation with Clockwork RNN is proposed to solve the uncertainty of recommended items by the on users long-term global interest, medium time interest or short time local interest. Firstly, the CW-RNN layer is introduced to extract user’s multi-scale temporal interest features from the historical interaction sequence between users and items. The convolution with CNN on the time scale dimension is then used to learn the user’s interest dependency on different time scales, and generate the user’s unified interest representations. Finally, it uses the fully connected layer to model the interaction between the unified multi-scale user interest representations and item’s embedding representations. Experiments are carried out on MovieLens-1M and Amazon Movies and TV, two public datasets. The results show that our proposed model improves the accuracy by 3.80% and 8.63% respectively compared with the current optimal sequential recommendation algorithms.
  • NLP Application
  • NLP Application
    WANG Chencheng, YANG Liner, WANG Yingying, DU Yongping, YANG Erhong
    2020, 34(6): 106-114.
    Abstract ( ) PDF ( ) Knowledge map Save
    Grammatical error correction is an important task in the field of natural language processing, which has attracted wide attention in recent years. This paper regards grammatical error correction task as a translation task to translate the wrong texts into the right ones. We use the transformer model with multi-head attention mechanism as framework, and propose a dynamic residual structure to combine the outputs of different neural blocks dynamically to better capture semantic information. Due to the lack of training corpus, we propose a data augmentation method to generate the parallel data by corrupting a monolingual corpus. The experimental results show that the proposed method based on dynamic residuals and data augmentation has significantly improved the performance of error correction, achieving the best performance on NLPCC 2018 Chinese grammatical error correction task.