2018 Volume 32 Issue 12 Published: 17 December 2018
  

  • Select all
    |
    Survey
  • Survey
    WU Siyuan, CAI Jianyong, YU Dong, JIANG Xin
    2018, 32(12): 1-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    The concept of readability is originally proposed by educators to assist the selection of suitable reading materials for learners. This paper surveys the existing works on automatic text readability measures, and summarized three types of methods: formula-based method, classification method and ranking method. This paper also outlines the databases and the extracted features in the literature. And finally, the future developments of the automatic readability research is provided.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    YUAN Yulin, LU Dawei
    2018, 32(12): 11-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses how to use semantic resources to assist computer in semantic understanding and commonsense reasoning. Firstly, we point out that human beings live in a world with common sense and meaning, and that artificial intelligence robots are required to understand the meaning of natural language to make commonsense reasoning. Then, we briefly summarize the advantages and disadvantages of two approaches of natural language processing based on knowledge and statistics. Then, we explain that neither concepts nor language can be truly understood with statistical methods and Deep Learning can hardly account for any knowledge. The paper shows with specific cases that Information Dictionary of Notional Word has been equipped with semantic role information and syntactic configuration of the words, which can be employed in the knowledge graph and the content computing and served for the improvement of the artificial intelligence. As the "Qualia Role" describes the encyclopedic knowledge of nouns, it can be used to answer commonsense questions such as what it is (formal role), what it consists of (constitute role), what it is made of (material role), how it is created (agentive role), and what it is used for (telic role).
  • Language Analysis and Calculation
    MA Dan, ZHAO Yiyi
    2018, 32(12): 24-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    The network based depiction of language is a major linguistic research method. This paper presents a semantic network base on semantics via Cytoscape. The network includes the function words, and semantic links and syntactic links are both established. It is revealed that 1) the sematic net has a larger dimension and average minimum distance; 2) the syntactic net is more hierarchical with a less clustering coefficient; 3) the function word “de”, “and”, “ge” serve as local centres.
  • Language Analysis and Calculation
    WEN Yuan, SONG Li, WU Taizhong, LI Bin, ZHOU Junsheng, QU Weiguang
    2018, 32(12): 31-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    The non-projective structure refers to the phenomenon that the word nodes on the dependency tree are misplaced with different word sequence in the original sentence. It has not been discussed in Chinese, following only the projection principle in the construction of Chinese dependency corpus. In this paper, we construct a Chinese abstract meaning representation (AMR) corpus of 10 149 sentences, in which 31.62% sentences have non-projective structures. Then we distinguish the three main types of the non-projective structures, modal words, topicalization and the component separation. Finally, we provide the solutions for the structures in the AMR parsing.
  • Language Analysis and Calculation
    LIU Mingtong, ZHANG Yujie, XU Jin’an, CHEN Yufeng
    2018, 32(12): 41-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    For neural networks based dependency parsing, this paper presents a novel architecture for transition-based dependency parsing leveraging fused multi-feature encoding. We model the stack states based on subtrees representations and encode structural dependency subtrees with TreeLSTM. Particularly, we propose a LSTM-based technique to encode the historical parsed dependency arcs and states as global features. Finally, based on fused multi-feature encoding, we combine the extracted local features and global features for parsing decision. Experiments on Chinese Penn TreeBank (CTB5) show that our parser reaches 87.8% (unlabeled) and 86.8% (labeled) attachment accuracy with a greedy strategy, which effectively improves neural transition-based dependency parsing.
  • Language Analysis and Calculation
    SUN Cheng, KONG Fang
    2018, 32(12): 48-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a subtask of discourse analysis, generating a proper discourse structure is critical for discourse comprehension and downstream discourse applications. Based on Chinese discourse treebank annotated under connective-driven dependency tree schema, a complete Chinese discourse structure generating framework is proposed. A statistical result on Chinese discourse corpus is given along with an evaluation protocol to measure the performance of discourse parser. The effectiveness in encoding discourse substructure is also compared between different distributed representation approaches.
  • Language Resources Construction
  • Language Resources Construction
    Wang Sibo, Wang Peiyan, Zhang Guiping
    2018, 32(12): 57-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic knowledge base is a basic resource of natural language processing. The existing large-scale semantic knowledge base is basically generic knowledge base, lacking the domain specific semantic knowledge. This paper proposes a semi-automatic method of constructing the semantic knowledge base of aviation terms by HowNet. It consists of four key processes of construction, resulting altogether 2 000 descriptions of the term concept (DEF). Finally, the validity of the method is verified by comparing the term similarities obtained by manual annotation and those obtained according to the term DEF.
  • Language Resources Construction
    RUAN Chong, SHI Wenxian, LI Yanhao, WENG Yijia, HU Junfeng
    2018, 32(12): 67-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Paraphrase corpus is fundamental to research in paraphrase phenomenon, while Chinese paraphrase corpus is hardly available in academia. In this paper, we collected multiple Chinese translations of the novel Jane Eyre, obtaining roughly 50 000 parallel paraphrasing sentences. Then, we managed to extract more than 9 000 pairs of lexical paraphrase knowledge. We further modified METEOR, an automatic machine translation evaluation metric, to better evaluate Chinese paraphrase quality and provided a Chinese paraphrase evaluation dataset. The close test proved a better quality of our mined knowledge than that of Tongyici Cilin.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    XIA Tianci, SUN Yuan
    2018, 32(12): 76-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    Extracting the entities and the relationship between them from unstructured texts is a challenging issue. This paper applies the joint model in Tibetan to perform the entity identification and relation extraction at the same time. An end-to-end sequence labelling framework of BiLSTM is adopted, and the POS information is integrated to enhance the performance. It is also demonstrated that the character-level processing method is more effective in Tibetan than the word-level processing. The experimental results show that the method improves the accuracy by 30%~40%, compared the SVM and LR.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    YE Lei, YU Zhengtao, GAO Shengxiang, LIU Shulong, ZHANG Yafei
    2018, 32(12): 84-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to generate a summary for a news event reported in both Chinese and Vietnamese, a multi-feature fusion method for bilingual news summarization is proposed. It employs the cross-lingual correlations between sentences in the news text. Firstly, this method analyzes the co-occurrence degree of news elements and the similarity between sentences. Then, these two features are integrated into an undirected graph and a ranking algorithm is used to sort sentences. Finally, important sentences are selected and the redundancy is removed to generate a summary. Experiment on the Chinese and Vietnamese bilingual news archive shows that the proposed method achieved good results.
  • Information Extraction and Text Mining
    ZHENG Jie, KONG Fang, ZHOU Guodong
    2018, 32(12): 92-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ellipsis is a common linguistic phenomenon, which is ubiquitous especially in short texts such as QA and dialogue. This paper builds a sequence-to-sequence neural network model for short texts to identify and recover ellipsis. Various experiments are conducted on the collected and sorted short text corpus for QA and dialogue, demonstrating good performances of the proposed model ellipsis identification and recovery.
  • Information Extraction and Text Mining
    YAN Rong, GAO Guanglai
    2018, 32(12): 100-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposed a novel pseudo topic analysis approach based on the community structure in the topic network and the relationships between the topics. It represents the text semantics from the perspective of network structure, which is a remedy to existing statistical topic modeling methods.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    HUANG Qiangjia, HUANG Peijie, LI Yanghui, DU Zefeng
    2018, 32(12): 109-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    The end-to-end dialogue control of utterances without slot values is a challenging issue. This paper proposes an end-to-end hybrid coding network that combines explicit utterance features and implicit context information to handle utterances without slot information. Specifically, on the basis of feature expressions extracted from the "explicit" dialogue sequence by the convolutional neural network (CNN), the system action classification model is further enriched by constructing and capturing the "implicit" background system context information in the dialogue sequence. Experiments on the task-oriented restricted domain Chinese SDS shows that, compared to the existing methods, the proposed method achieves significant improvements in both per-response accuracy and per-dialog accuracy.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LIU Jiao, CUI Rongyi, ZHAO Yahui
    2018, 32(12): 118-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    A cross-linguistic sentiment classification algorithm based on semantic fusion is proposed for product reviews. First, information of different languages is generated by the open-source tool Word2Vec in advance. Then, the auto-associative memory relationship is proposed to extract the cross-lingual document semantic, according to statistical relevance of word vector between different languages. Local perception and weight sharing techniques of convolutional neural networks are applied to amalgamate of complex semantic expression in auto-associative memory model, so as to generate the phrase features of different lengths. The dense combination of high-level semantic features is learned by deep neural network for all languages, which paves the way for classification predictions. It is demonstrated that, for positive and negative sentiment classification of cross-lingual sentiment corpus, the proposed model is much more effective than other existing algorithms
  • Machine Translation
  • Machine Translation
    ZHANG Haoyu, ZHANG Pengfei, LI Zhenzhen, TAN Qingping
    2018, 32(12): 125-131.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine reading comprehension has attracted concerns in the field of Natural Language Processing. To deal with the Chinese machine reading comprehension data set —DuReader, this paper presents an extractive language model called Mixed Model with multiple strategies including recurrent neural network, paragraph fusion and self-attention mechanism. The proposed method achieves a Rouge-L score of 54.2 and a Bleu-4 score of 49.14 on the DuReader test set.
  • Machine Translation
    HUO Huan, WANG Zhongmeng
    2018, 32(12): 132-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    For Chinese machine reading tasks of multi-passage continuous answer spans, this paper proposes a model based on deep hierarchical features to extract the three-level deep features in details, in snippets, and in full-texts. In this model, words represented by word vectors are encoded in a recurrent layer to obtain the detailed features. The snippets features are constructed through several convolution layers and highway layers. And the full-text features are extracted from candidate passages to perform the overall inspection. Finally, through these features, the passage where the answer is located and the answer spans within the passage is determined. Experimented on 2018 NLP Challenge on Machine Reading Comprehension, the proposed model achieves a Rouge-L score of 57.55 and a Bleu-4 score of 50.87.