2018 Volume 32 Issue 10 Published: 15 October 2018
  

  • Select all
    |
    Survey
  • Survey
    XUE Ping
    2018, 32(10): 1-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    Natural language is the most natural means for human communication. But its complexity and ambiguity often pose challenges for effective communication. In modern societies, especially during this Information Age, a number of industrial scenarios and scientific areas as well as various scenarios of human-machine interaction require precise but natural information representation and communication. These requirements motivated the concept and development of controlled natural languages (CNLs), which aim to achieve an optimal balance between information precision and naturalness to support effective human-to-human communication and human-machine interaction. This paper discusses CNL, its properties, applications and computational processing. It uses commercial airplane technical documentation as a use case to show the importance of CNL. It also discusses the significance of CNL to other areas such as the area of artificial intelligence.
  • Survey
    LI Heng, SHEN Huawei, HUANG Wei, CHENG Xueqi
    2018, 32(10): 11-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the deep integration of mobile Internet and Social network, social media and application based on Location Based Service(LBS) becomes more popular, which is currently a major ongoing research effort on Geo-Social Networks(GSN). Due to spatio-temporal traits of Location Based Social Network (LBSN), mass of data visualization which is different from traditional information visualization must be expressed in combination with its geographic information characteristics. By analyzing the mass data pumped for Geo-Social Networks(GSN), this paper reviews the spatio-temporal SNS data extraction, and the massive spatio-temporal information visualization. This survey contributes to convenient, fact and direct extraction of useful and liable visualized information from massive data in GSN, and direct Information Visualization.
  • Language Resources Construction
  • Language Resources Construction
    SUN Daogong, KANG Shiyong
    2018, 32(10): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the existing researches on verbs, this paper puts forward principles and ideas of developing verb semantic dictionary, defines explains the attribute information involved, and explains the overall file structure and each library. An open verb semantic knowledge dictionary is finally constructed, covering both lexical meaning and syntactic meaning, including morphologies, word classes, paraphrases, word meanings, semantic fields, syntactic category information, semantic category, semantic pattern. The dictionary provide support for ambiguity interpretation, lexical relation research, syntax-semantics interface, semantic pattern extraction, etc.
  • Language Resources Construction
    GUO Lijuan, LI Zhenghua, PENG Xue, ZHANG Min
    2018, 32(10): 28-35,52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dependency parsing has attracted much attention in the research community. There is no public, integrated and systematic annotation guideline for Chinese dependency treebank. Considering the special linguistic phenomena in web texts, this paper proposes a new annotation guideline for Chinese dependency treebank, which is adapted to multi-domain and multi-source texts. This annotation guideline aims to accurately depict the syntactic structures of various linguistic phenomena, and to ensure annotation consistency as well. Based on the proposed guideline, we have annotated about 30 000 Chinese sentences with their dependency structures.
  • Machine Translation
  • Machine Translation
    ZHANG Wen, FENG Yang, LIU Qun
    2018, 32(10): 36-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    Attention-based neural machine translation models have become extremely popula, with an encoder-decoder framework to model translation as a sequence to sequence problem. In this paper, we replace the gated recurrent units in the classical encoder and decoder with the simple recurrent units (SRUs), and deepen the structure of the encoder and decoder by stacking network layers to improve the performance of neural machine translation model. We conducted experiments on the German-English and Uyghur-Chinese translation tasks. Experiment results show that the performance is significantly improved without extra training speed, especially with residual connections.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    CAI Zhijie, SUN Maosong, CAI Rangzhuoma
    2018, 32(10): 45-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Complex networks have part or all of the properties of self-organization, self-similarity, attractors, small world, and scale-free. Languages and characters, as the crystallization of human wisdom and civilization, are complex networks formed through long evolution. The paper presents 97 Tibetan characters' co-occurrence networks derived from 90 passages from 6 representative corpus of Tibetan poems, proses, politics, Buddhism, teaching materials and spoken language(15 passages per corpus). This paper analyzes the shortest path length, clustering coefficient and degree distribution of Tibetan characters' co-occurrence networks. Experimental data shows that the 97 Tibetan characters' co-occurrence networks have small world effect and scale-free property, indicating that all Tibetan characters' co-occurrence networks may have small world effect and scale-free property.
  • Ethnic Language Processing and Cross Language Processing
    JIN Guozhe, CUI Rongyi
    2018, 32(10): 53-58,68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Korean POS tagging is the basis of the Korean information processing, and the result of POS tagging affects Korean Natural Language Processing directly. First of all, in order to solve the problem of inconsistency between the representation morpheme and original morpheme, this paper proposes a method of recovering the original form of Korean morpheme that integrates Korean Jamo information on the basis of seq2seq model. Then the LSTM-CRF model is used to achieve Korean spacing and POS tagging task. The experimental result shows that our method achieved 94.75% POS tagging F1-score, which is better than other methods.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    ZHOU Guohua, SONG Jie, YIN Xinchun
    2018, 32(10): 59-68.
    Abstract ( ) PDF ( ) Knowledge map Save
    Cost-sensitive learning can efficiently solve the class imbalanced problem in practical applications. However, when the label information of samples is limited or insufficient, the classification accuracy of the cost sensitive learning classifier is significantly reduced. To address this issue, a novel classification method named locality preserving cost sensitive Laplacian support vector machine (LPCS-LapSVM) is proposed. LPCS-LapSVM extends the semi supervised learning framework by introducing the ideas of cost-sensitive learning and local geometry of data together. Due to considering the intrinsic information and the local geometric distribution of samples, LPCS-LapSVM improves the classification performance of cost sensitive support vector machine in the classification scene with the limited labeled samples. Experimental results on UCI data set demonstrate the advantages as well as the superiority of the proposed method.
  • Information Extraction and Text Mining
    DONG Xiaozheng, SONG Rui, HONG Yu, ZHU Fenhong, ZHU Qiaoming
    2018, 32(10): 69-77.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper studies the Domain-oriented Headline Classification (DHC) of Chinese news. The previous work defined DHC as a short text classification problem, and applied the classical classification model and convolutional neural network (CNN) model to solve the problem. However, these methods ignore the intrinsic features of the headline, i.e. a compulsive semantic expression based on the condensed text and weakly related terms. To exploit the adavantage of RNN in semantic representation, we apply the Long-short Term Memory (LSTM) and Gated Recurrent Unit (GRU) in DHC, achieving up to 81% F1-score. In addition, we systemically analyze the performance of the state-of-the-art neural network based classification models, with the purpose of revealing their common advantages and disadvantages for DHC. By comparing “multi-classification” to “binary-classification”, we observed that the existing neural network models fail to achieve a performance better than 81% F1-score on the samples of strong domain ambiguity and weak domain characteristics.
  • Information Extraction and Text Mining
    PENG Zhensheng, GONG Qingge, GAO Zhiqiang, DUAN Yanyu, ZENG Zixian
    2018, 32(10): 78-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to extract news title automatically from large amounts of complex and nonstandard Web pages, this paper proposes a news title extraction algorithm based on density and text features (TEDT). A corpus decision model is presented by combining the text density distribution and language features of a Web page. The model divides the Web page into corpus area and candidate title candidate area, and then the corresponding key-value weight set is calculated by TextRank algorithm after selecting the corpus. An improved similarity calculation method is finally applied to extract news title. The experimental result shows that the accuracy rate and recall rate of TEDT are better than the traditional news title algorithm based on rules and similarity. It is also proved that TEDT is not only effective for mainstream news websites, but also widely applicable to complex and nonstandard Web pages.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    ZHAO Liqiang, JIANG Chong, JING Ke
    2018, 32(10): 87-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    At present, there are few researches on the content distribution strategy for the internet novel servers, and lack of scientific and effective evaluation criteria for the popularity of the network novel. This paper proposes to measure the popularity of network novels on the novels retrieved from Qidian (www.qidian.com). Bayesian network, random forest algorithm and Logistic regression are applied to establish the prediction model, and the random forest out-performs with 97.097% accuracy( subject to 0.112 8 MSE). With this method, the problem of inaccurate deployment of low-hit novels in CDN system and user access delay can be alleviated, so as to provide effective guidance for content distribution strategy and improve content Hit rate.
  • Sentiment Analysis and Social Computing
    DENG Wenjun, YUAN Hua, QIAN Yu
    2018, 32(10): 98-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of social media, more and more enterprises use it to release information with important commercial and research value. But social media data is characterized by strong noise, multi-type and multi-theme, which challenges the evolution analysis. This paper presents a method of enterprise behavior identification and evolution analysis. Firstly, the enterprise information is indentified, and then the behavior of the enterprise is analyzed. Finally, the marketing suggestions are provided for the competitive enterprises according to the evolution analysis. The experimental results show that the method of enterprise behavior recognition and evolution analysis is of great value.
  • NLP Application
  • NLP Application
    XU Linhong, LIN Hongfei, QI Ruihua, YANG Liang
    2018, 32(10): 109-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Advertising language is indispensible part of advertising communication, which condenses the core value of brand. Based on ancient poetry, this paper proposes the generation and evaluation model of homophonic advertisement generation via multi-feature fusion approach. First of all, the speech template is applied to obtain the candidate advertising phylum in the model, and nine characteristics of the advertising language is calculated from four dimensions of pronunciation, shape, semantic and sentiment. The the characteristic matrix of candidate advertising cluster is obtained. At last, the high valued advertising cluster is filtered by using the evaluation algorithm based on the principal component analysis and the weight coefficient. The experimental results show that the proposed method can accurately evaluate the quality of the advertising language, and is close to the results of manual evaluation.
  • Machine Reading Comprehension
  • Machine Reading Comprehension
    LIU Kai, LIU Lu, LIU Jing, LV Yajuan, SHE Qiaoqiao, ZHANG Qian, SHI Yingchao
    2018, 32(10): 118-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine Reading Comprehension (MRC) is a challenging task in the field of Natural Language Processing (NLP) and Artificial Intelligence (AI). 2018 NLP Challenge on Machine Reading Comprehension (MRC2018) aims to advance MRC technologies and applications. The challenge releases the largest scale, open-domain, application-oriented Chinese MRC dataset, provides an open sourced baseline systems and adopts improved evaluation metrics. Over one thousand teams registered for this challenge and the overall performance of the participant systems have been greatly promoted. This paper presents an overall introduction to MRC2018, and gives a detailed description of the evaluation task settings, evaluation organization, evaluation results and corresponding result analysis.
  • Machine Reading Comprehension
    LIANG Xiaobo, REN Feiliang, LIU Yongkang, PAN Lingfeng, HOU Yining, ZHANG Yi, LI Yan
    2018, 32(10): 130-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine reading comprehension (MRC) is an important task in natural language processing and artificial intelligence. To improve the Chinese multi-document MRC, this paper proposes N-Reader, an end-to-end MRC model based on neural network. It applies a two-layer self-attention mechanism to encode the input documents to ultilize both the information from a single document and the similarity information from multiple documents. Besides, this paper also proposes a multi-paragraph completion algorithm to preprocess the input documents. This preprocessing method can further recognize the semantics-related paragraphs among input documents, and contribute to a better answer sequence. In the “2018 NLP Challenge on Machine Reading Comprehension” jointly organized by Chinese Information Processing Society of China (CIPS), Chinese Computer Federation (CCF), and Baidu Inc., our model ranks No.3 among the hugely competitive models.
  • Script Processing
  • Script Processing
    GU Shaotong
    2018, 32(10): 138-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Oracle-bone script is an mature writing system used in Shang dynasty, which is engraved on tortoise shells and animal bones. Oracle-bone script is essentially a plane figure, in which the strokes and structures aren't stable, and many characters look like a picture. So it's hard to distinguish obvious structures, hard to write and remember. The existing coding input methods have fewer audiences, low efficiency and limited use. This paper analyzes the fractal property of oracle-bone script according to the theory of fractal geometry. On this basis, the 2D plane rectangular coordinate system is established through the center of gravity of glyph, and the planar graph of oracle-bone glyph is divided into four quadrants. By using fractal geometry principle, the oracle-bone glyph is formed into a component description code by calculating the glyph and fractal dimensions of each quadrant. The oracle-bone script is identified by registration with a fractal feature library of the oracle-bone script. Experimental results show that the scheme of fractal geometry is effective to recognize the oracle-bone script.