2020 Volume 34 Issue 12 Published: 20 January 2021
  

  • Select all
    |
    Survey
  • Survey
    LIU Wei, PENG Xin, LI Chao, WANG Pin, WANG Lihong
    2020, 34(12): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    The stance detection aims to identify the attitude (i.e., in favor of, against, or none) towards a given target, such as an event, a product, a policy, a person, a service, etc. Mining users' stances on social media is important to public opinion monitoring and information recommendation. This paper presents a survey on stance detection: introducing the conception of stance detection, summarizing various learning based methods, and describing the data sets. Finally, this paper discusses the future directions of stance detection.
  • Survey
    LIN Wangqun, WANG Miao, WANG Wei, WANG Chongnan, JIN Songchang
    2020, 34(12): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Knowledge graph describes the concept, entity and their relationship in the form of semantic network. In this paper, we formally describe the basic concepts and the hierarchical architecture of knowledge graph. Then we review the state-of-the-art technologies of information extraction, knowledge fusion, schema, knowledge management. Finally, we probes into the application of knowledge graph in the military field, revealing challenges and trends of the future development.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHANG Yinbing, SONG Jihua, PENG Weiming, GUO Dongdong, SONG Tianbao
    2020, 34(12): 17-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the international Chinese language teaching, the quantitative analysis of Chinese vocabulary comprehensive complexity benefit such aspects as the determination of vocabulary acquisition order for Chinese second language learners, the vocabulary selection in the process of textbook compiling, and so on. Based on the analysis of the Chinese character attributes, general attributes and statistical attributes of words, this study constructs a quantitative analysis model of Chinese word difficulty based on analytic hierarchy process (AHP). By comparing the consistency with the vocabulary grading the existing syllabus, the validity of the model is verified. It provides a possible solution to the vocabulary grading, text difficulty analysis and text simplification, etc.
  • Language Analysis and Calculation
    SUN Zhenhua, ZHOU Yi, ZHU Qiaoming, JIANG Feng, LI Peifeng
    2020, 34(12): 30-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse analysis is a hot topic in the field of Natural Language Processing. Discourse nuclearity recognition, a subtask of discourse analysis, focuses on recognizing the main and secondary content of a discourse, to better understand and grasp its core content. This paper focuses on the task of macro Chinese discourse nulcearity recognition and proposes a recognition method based on discourse topic. This method introduces the semantic interaction between different discourse units and that between the discourse unit and its topic to identify the nuclearity. Moreover, it applies the selection mechanism of the discourse topic to further improve the performance of nuclearity recognition.Experimental results on MCDTB show that the proposed method outperforms the state-of-the-art baselines.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    WANG Qi, TIAN Mingjie, CUI Rongyi, ZHAO Yahui
    2020, 34(12): 39-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    A bilingual topical word embedding model is proposed for the Chinese-Korean cross-lingual text classification task. The model combines the topic model with the bilingual word embedding to solve the influence of the ambiguity caused by polysemy on the accuracy to cross-lingual text classification. Firstly, the word embedding representation of bilingual words is trained in a large scale parallel sentence pairs with word-alignment. Secondly, the dataset of classification task is processed and represented by topic model, and the topic words in both languages are obtained. Finally, the word embeddings of these topic words are input into the traditional text classifier and the deep learning text classifier. The experimental results show that the accuracy reach 91.76% in the Chinese-Korean cross-lingual text classification task.
  • Ethnic Language Processing and Cross Language Processing
    SECHA Jia, CIZHEN Jiacuo, CAIRANG Jia, HUAGUO Cairang
    2020, 34(12): 48-53,64.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper puts the Tibetan character error detection task as a classification problem. First of all, the syllable confusion subset is built according to the language knowledge and each Tibetan sentence is add with noise. Then a deep bi-direction representation based BERT is applied in the classification model. Two baseline model and test sets of different domains are then constructed. The experimental results show that this method is superior to the two baseline models. The accuracy of sentence classification in the same method can reach 93.74%, and achieve 83.6% in test from different fields. In the syllable level, the performance of true negative s is 74.53%, and false negative is 2.30%.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LI Dan, XU Tong, ZHENG Yi, WANG Zhefeng, CHEN Enhong
    2020, 34(12): 54-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    The general named entity recognition fails to capture the features in Chinese characters as well as Chinese medical records. In this paper, we integrate the BERT into a joint model of bi-directional long short-term memory and conditional random fields for better performance. Considering the unique feature of radicals for medical entities, we encode the radical information into the word vector, and then modify the scoring function of the CRF layer. Experiments on two real-world electronic medical record datasets validate that the proposed method outperforms the state-of-the-art baseline methods, especially for the disease-related named entities.
  • Information Extraction and Text Mining
    REN Quan
    2020, 34(12): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an extension of named entity recognition task, fine-grained entity typing task aims to assign more fine-grained types to entities according to mention and contexts. Due to the high cost and error-prone of the fine-grained types annotation, we study the fine-grained entity typing only by a small number of samples. This paper first proposes a feature extraction method which can extract entity information from word-level and character-level, respectively. Then, combining with prototype network, the method transforms the multi-class classification task into single-class classification task, and realizes fine-grained entity classification by calculating the distances from prototypes in metric space. Tested on the public dataset FIGER (GOLD) under the settings of the few-shot learning and the zero-shot learning, the proposed method achieves ideal results. Under the setting of the few-shot learning, the proposed method out-performs the baseline on all metrics, in particular the macro-F1 is increased by 2.4%.
  • Question Answering and Dialogue System
  • Question Answering and Dialogue System
    YANG Zhizhuo, LI Chunzhuan, ZHANG Hu, QIAN Yili, LI Ru
    2020, 34(12): 73-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading comprehension QA for College Entrance Examination on Chinese is much challenging due to the fact that the questions are more abstract. In addition to the question similarity analysis, the extraction of answer candidate sentences should also pay more attention to the topic and opinion sentences. This paper proposes to extract the candidate answer sentences by frame semantic match and frame semantic relation. By identifying the discourse topic sentences, the topic and opinion sentences related to the questions are generated. Then the top-six candidate answers are selected based on ranking results. In the experiment, the recall of the method on the College Entrance Examination of Beijing in recent twelve years is 68.69%, which verifies the effectiveness of the method.
  • NLP Application
  • NLP Application
    LIANG Jiannan, SUN Maosong, YI Xiaoyuan
    2020, 34(12): 82-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese classical poetry, with its long history, is one of the representatives of Chinese classical literature and a treasure of Chinese traditional culture. Poetry retrieval is a comparison of the content between poetry, finding poems that are similar in semantics and artistic conception, which demands requires an in-depth understanding of the content and mood of the whole poem. This paper applies the recurrent neural network (RNN) to automatically learn the semantic representation of ancient poems. A variety of methods is designed to automatically calculate the correlation between two poems to further calculate the semantic distance between them, achieving the recommendation of poetry. The experimental results of automatic and manual evaluation show that the model can generate good quality poetry retrieval results.
  • NLP Application
    JIA Yuxiang, WANG Lu, LIU Pengcheng, WANG Qian, ZHANG Yue, ZAN Hongying
    2020, 34(12): 92-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Novel is a literary genre that centers on character creation, depicting social life through complete plots and specific environmental descriptions. Modeling fictional characters is essential for literary text understanding and literary text mining. In this paper, we construct a large-scale novel corpus and extract characters and their dependency features. We propose a skip-gram based model to train character embeddings, with the character as the target while the dependency features as the contexts. Based on the trained character embeddings, we further investigate the tasks of character similarity computation, character clustering, and character profiling. The experimental results show a good performance of the distributed representation of fictional characters in the above tasks.
  • NLP Application
    ZHANG Zhixing, ZHANG Jiaying, GAO Daqi, RUAN Tong,
    WANG Jun, HE Ping, YAO Huayan
    2020, 34(12): 100-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    On Shanghai Regional Health Platform with electronic medical record data of 38 tertiary hospitals, the diversity and ambiguity of clinic indicators have seriously affected medical data mining. In this paper, we propose a semi-automatic terminology base construction solution based on the following four steps: schema design, information extraction, knowledge fusion and knowledge verification. We first build a standard indicator sub-base according to the medical insurance standard provided by Shanghai Municipal Health Commission. Then we use BERT-based clinical indicator alignment model to integrate indicators in 38 hospitals as synonyms into the standard. The constructed terminology base contains 23, 495 entities and 47, 746 factual triples, with potential applications in medical data cleaning, medical record retrieve and other tasks. Experiments show that the F1-score of our alignment model reaches 95.78%, and its application in colorectal cancer data mining task can improve the record up to 94%. In addition, a part of this terminology database related to colorectal cancer has been published in dcazb.ecustnlplab.com.