2020 Volume 34 Issue 1 Published: 16 March 2020
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    YANG Siqin, XU Wenyu, JIANG Minghu, ZHANG Xiaochen
    2020, 34(1): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    This study uses ERPs to explore the cognitive processing difference between phonetic puns and semantic puns in Chinese. The results show that semantic puns' accuracy is apparently lower than phonetic puns' and illogical discourses. The EEG data between 300ms and 900ms triggered by phonetic puns, semantic puns and illogical discourses are significantly different, both semantic puns and illogical discourses triggered N400. The EEG Mapping shows that the N400 triggered by semantic puns appeared slightly later than triggered by illogical discourses. As for 600ms and 900ms, phouefic puns triggered P600. It is concluded that the cognitive processing difference between phonetic puns and semantic puns in Chinese tends to be related to punspresentation and expressing effect.
  • Language Analysis and Calculation
    SONG Zuoyan, SUN Ao
    2020, 34(1): 10-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    From the perspective of qualia structure, we analyze the semantic relationships between place constituents and head constituents in place-object compounds. This study shows that place constituents indicate not only existential locations but also places where the objects are commonly used or formed. Implying predicates serve as telic roles or agentive roles of the compound nouns, which are supposed to be shown in the paraphrases to present the specific semantic relationships among the constituents of compounds and further demonstrate the naming motivations of objects. However, the paraphrases of some words lack relevant predicates. Inaddition, this study reveals that telic quale and agentive quale play a much more important role in naming objects, and telic construction and agentive construction lie in a higher level in the hierarchies of modifier-head constructions.
  • Language Analysis and Calculation
    WANG Qingjiang, ZHANG Lin
    2020, 34(1): 17-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    When a phrase enters into a sentence, its syntactic category should be changed with the change of its grammatical nature, which depends on the structural requirements of bigger phrases into which the phrase is affiliat-ed. To explain the phenomenon, Combinatory Categorial Grammar should be augmented with the corresponding conversional rules. According to the set theory, Chinese has noun-verb-adjective successive inclusion, from which various grammatical nature-changed overlapping between basic syntactic structures could be induced. Under the premise of grammatical natures having their definite syntactic functions, category-conversional rules corresponding to nature-changed overlapping are proposed, and the new Combinatory Categorial Grammar C2-CCG with category-conversional mechanism is presented. Examples show this phrase-based formal grammar is promising in explaining various grammatical nature-changed overlapping in Chinese syntax.
  • Language Resources Construction
  • Language Resources Construction
    LIANG Yuhai, ZHOU Qiang
    2020, 34(1): 23-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    The insufficient human dialogue corpus has been a key factor restricting the performance of dialogue generation system, especial for the Chinese dialogue corpus. This paper presents the automatic construction of CEDAC, a multi-turn dialogue corpus of human daily conversation with 978 109 pairs of Chinese-English bilingual utterances. To obtain this corpus, time-stamps can be used to synchronize English subtitles and corresponding Chinese subtitles, so that abundant Chinese-English bilingual subtitles can be generated. Then, the bilingual subtitles and the utterances in the corresponding English scripts are alinged, so that the tags of speaker and scene in the scripts can be mapped to each pair of sentences in the bilingual subtitles. The experimental result shows it achieves the accuracy of 97.0% on scene boundary annotations and91.57% on speaker annotations. The corpus lays a good foundation for the following research on automatically annotating speakers of subtitles and multi-turn dialogue automatic generation system.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    HONG Wenxing, HU Zhiqiang, WENG Yang, ZHANG Heng, WANG Zhu, GUO Zhixin
    2020, 34(1): 34-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Legal knowledge centered cognitive intelligence is an important topic for judicial artificial intelligence. This paper proposes an automated knowledge graph construction approach for judicial case facts. Based on the pre-training model, models for entity recognition and relation extraction are presented. For the entity recognition task, two pre-training based entity recognition models are compared. For the relation extraction task, a multi-task joint semantic relation extraction model is proposed incorporating translating embeddings. The knowledge representation learning of case facts is obtained while completing the relation extraction task. For “motor vehicle traffic accident liability dispute”, compared with the baseline model, the entity recognition can be increased by 0.36 in F1 score, and the relation extraction by 2.37 F1 score. Based on the proposed method, a case facts knowledge graphs are established on a couple of hundred thousand judicial documents, enabling the semantic computing for judicial artificial intelligence applications such as case retrieval.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    WUMAIERJIANG Maimaitiming, GULINIGEER Abuduwaili, MAIHEMUTI Maimaiti,
    KAHAERJIANG Abiderexiti, TUERGEN Yibulayin
    2020, 34(1): 45-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a basic task agglutinative languages processing, word stemming would directly influence the performance of other tasks. The existing Uzbek word stemming task still relies on rule-based approaches. This paper presents the application Conditional Random Field(CRF) and Bidirectional Gated Recurrent Unit(Bi-GRU) in this task, in which the minimum division unit is the character. The experimental results show that the proposed models, which are based on sequence labeling significantly improves the performance compared with the rule-based method.
  • Ethnic Language Processing and Cross Language Processing
    NIU Mijia, FEI Long, GAO Guanglai
    2020, 34(1): 51-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, the resources of speech database for Mongolian speech recognition are relatively scarce. To automatically annotate the existing Mongolian audios and corresponding texts, such as TV plays and broadcasts, this paper presents an automatic speech-text alignment method for long Mongolian audio so as to expand Mongolian speech database. In the front-end processing stage, noise segments are filtered and deleted by using Voice Activity Detection technology based on Gaussian Mixture Model. In the speech recognition, the Mongolian Acoustic Model based on Feedforward Sequential Memory Networks is constructed. Finally, based on the Vector Space Model, the hypothesis sequence obtained from speech recognition and the reference phone sequence are matched by the sentence-level Dynamic Time Warping algorithm. The experiments show that the automatic speech-text alignment for Mongolian long audio is improved by 31.09% compared with the traditional Needleman-Wunsch algorithm.
  • Ethnic Language Processing and Cross Language Processing
    AN Suyala,WANG Siriguleng
    2020, 34(1): 58-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    Organization name translation directly affects translation performance. In this study, a transformer-based neural network model is proposed for this task. Compared with a traditional phrase-based SMT model and an improved block-based MT model, the experimental results show that the transformer NMT increased by 0.039 in terms of BLEU 4 in the Chinese-Mongolian Organization name translation task.
  • Ethnic Language Processing and Cross Language Processing
    SARDAR Parhat, MIJIT Ablimit, ASKAR Hamdulla
    2020, 34(1): 63-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    Uyghur is a derivative language in which words are coined by stems concatenated with affixes, in which the stem is the word unit with practical meaning, and the affix provides grammatical function. This paper proposes Uyghur short text classification technique based on morpheme sequences and LSTM. A robust morpheme segmentation and stem extraction methods are trained on the word-morpheme parallel corpora to extract the stems from web texts. The stem sequence text corpus is thus obtained and then fed into the Word2Vec algorithm. With the achieved stem embedding, the LSTM is applied to implement Uyghur short text classification experiments. The experimental results show the proposed method achieves 95.48% classification accuracy, indicating that for derivative languages like Uyghur, especially for noisy texts, stem-based classification method has more excellent performance.
  • Ethnic Language Processing and Cross Language Processing
    GE Haizhu, KONG Fang
    2020, 34(1): 71-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Elementary discourse unit (EDU) recognition is a fundamental task of discourse analysis. From the perspective of discourse cohesion, the theory of discourse topic structure deems that each EDU is closely related to rheme-theme recognition task. Inspired by this notation, this paper proposes a Chinese elementary discourse unit and theme-rheme joint detection method based on multi-task learning. This method applies BiLSTM and Graph Convolutional Networks to represent the EDU’s sequential and structured topological information, and improved the final performance by sharing the parameters of the two model via multi-task learning framework. The experimental results show that the performance of EDU and theme-rheme detection based on multi-task is better than that of the single-task learning model, with the F1-score of up to 91.90% and 85.65%, respectively.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Ningning, JU Shenggen, XIONG Xi,WANG Jingyan,ZHANG Rui
    2020, 34(1): 80-86,96.
    Abstract ( ) PDF ( ) Knowledge map Save
    Drug-Drug interaction refers to the inhibition or promotion between drugs. To improve the current Drug-Drug interaction relationship extraction model’s performance in the long sentences, this paper proposes a capsule network extraction model that combines the shortest dependent path. The approach first detects the shortest dependent path between two drugs in the parse of the original sentence, then applies the Bi-LSTM to obtain the embedding of the original sentence and the shortest dependent path. The embedding are them put into the capsule network, in which the dynamic routing mechanism could dynamically determine the amount of information transmitted and preserve the high-level feature information. The experimental results on the DDIExtraction2013 show that the proposed achieved 1.17% relative increase in F1 value compared with the current best approaches.
  • Information Extraction and Text Mining
    LI Mingyang, JIANG Jiawei, KONG Fang
    2020, 34(1): 87-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existence of ambiguity makes the entity linking task demand a large amount of information. Previous researches mainly uses two types of information, i.e., the information of the text containing the given mention and the external knowledge base. There are still two issues should be addressed. Firstly, current entity linking models have not benefited from the latest knowledge base, which has larger scale and wider coverage. Secondly, the text contains rich information including local context information of the mention and global information such as text topic. The combination approach of local and global information can be further improved. For the first problem, an entity candidate extraction approach considering both text relevance and prior knowledge is proposed to get the effective entity candidate set. For the second problem, a neural network with self-attention and highway network is proposed to represent both local and global information for entity linking. Experiments on six public datasets of entity linking show the effectiveness of our proposed approach. Furthermore, our system achieves the state-of-the-art performance using the latest general knowledge base.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    ZENG Wei, YU Weijie, XU Jun, LAN Yanyan, CHENG Xueqi
    2020, 34(1): 97-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Document ranking is one of the central tasks in a number of IR applications. In recent years, efforts have been made to apply reinforcement learning for learning document ranking models and a number of methods have been developed. Though preliminary success has been achieved, existing reinforcement methods still suffer from the sparseness of the relevant documents. In this paper, we propose to involve ground-truth ranking lists during the learning process, achieving a novel imitation learning-based learning to rank algorithm called IR-DAGGER. It utilizes the ranking lists sampled by the expert policy, which can enhance the learning efficiency while keeping the ranking accuracies. Experimental results based on OHSUMED and TREC showed that IR-DAGGER can outperform the state-of-the-art baselines for the tasks of relevant ranking and diverse ranking, indicating the effectiveness and efficiency of imitation learning in document ranking.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    WANG Jiancheng, XU Yang, LIU Qiyuan, WU Liangqing, LI Shoushan
    2020, 34(1): 106-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dialog sentiment analysis aims to classify the sentiment polarity of each utterance in dialogue, which plays a critical role in e-commerce customer service data analysis. Unlike sentiment analysis for a single sentence, an utterance’s sentiment polarity in dialog depends on its context. The recent methods mainly focused on modeling contextual connections using recurrent neural network and attention mechanism, ignoring the characteristic of the dialogue as a whole. Choosing the multi-task learning framework, we propose a novel model of detecting dialog topic distribution and each utterance’s sentiment polarity simultaneously. Dialog topic distribution, as a kind global information, is integrated into each word/utterance representation. In this way, each word and utterance has its meaning under particular dialog topics. The experimental results on a real-world dialog dataset in e-commerce customer service show that the proposed model can make full use of the dialog topic information, and significantly outperforms the baseline model that does not consider the dialog topic in Macro-F1 score.