2018 Volume 32 Issue 7 Published: 16 July 2018
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    LIU Yang, JI Lixin, HUANG Ruiyang, ZHU Yuhang, LI Xing
    2018, 32(7): 1-10,19.
    Abstract ( ) PDF ( ) Knowledge map Save
    The existing methods (mapping a word to a single vector) do not consider the problem of polysemy, which may cause the problem of ambiguity; Rather than mapping a word to multiple vectors, this paper proposes a computing method of multi-sense word embedding by: 1) fusing hierarchical attention mechanism with non-residual encapsulated gated convolution mechanism in the sub-sense layer and synthetic sense layer of the words in the selected context window, and 2) obtains the synthetic sense embedding of the target word under the asymmetric window to predict the target word. On small-scale corpus, the proposed multi-sense word embedding achieves at most 1.42% increase in the accuracy of the word analogy task, an average 2.11% (up to 5.47%) improvement in the word similarity tasks including WordSim353, MC, RG, and RW. In addition, this method also significantly improves the performance of the language modeling compared with other methods predicting target words.
  • Language Analysis and Calculation
    TAN Yongmei, LIU Shuwen, LV Xueqiang
    2018, 32(7): 11-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    A CNN and BiLSTM based Chinese textual entailment recognition method is presented. By using CNN and BiLSTM, the method can automatically extract relevant features, and then generate the initial result by a fully connected layer. The final result is further processed by semantic rules. Evaluated on the dataset of RITE-VAL in 2014, the method obtains 61.74%, outperforming the top-ranked 61.51% in that evaluation campaign.
  • Language Resources Construction
  • Language Resources Construction
    REN Lu, YANG Liang, XU Linhong, FAN Xiaochao, DIAO Yufeng, LIN Hongfei
    2018, 32(7): 20-29.
    Abstract ( ) PDF ( ) Knowledge map Save
    Jokes, as a national intangible cultural heritage, exist in daily life with a long history. Close as it is to people, it remains one of challenging artificial intelligence research issues. The large-scale Chinese joke corpus constructed in this paper provides necessary data resources for subsequent researches. This paper summarizes the relevant theoretical basis of jokes on which the joke corpus is based, and then makes a detailed introduction to corpus construction, corpus analysis and so on. It also identifies jokes from stories, news, proverbs, and Weibo expressions, revealing such joke features as conciseness, episode and emotions. An experiment is carried by validating jokes form negative samples of equal length.
  • Machine Translation
  • Machine Translation
    CAI Zilong, YANG Mingming, XIONG Deyi
    2018, 32(7): 30-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural machine translation performs well in large-scale language pairs, but less satisfactory for low resource language pairs. This paper employs the data augmentation technology to expand the training data for low-resource language pairs, which can enhance the generalization ability of neural machine translation. Experimented with the language pairs of Tibetan-Chinese and Chinese-English, the translation quality is improved for both tasks significantly, achieving 4.0 BLEU point increase at the training scale of 100,000 pairs of sentences.
  • Machine Translation
    LI Yinqiao, HAN Ambyer, XIAO Tong, BO Le, ZHU Jingbo, ZHANG Li
    2018, 32(7): 37-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Data parallelism aims at reducing time consumption without changing network structure while training neural language model. However, the result is not satisfactory due to frequent data transmission between multiple devices. In this paper, we compare the effect of gradient update strategies based on the All-Reduce algorithm and the sampling-based approach in data transmission. On four NVIDIA TITAN X (Pascal) GPUs, they achieve an acceleration rate of 25% and 41%, respectively. We also discuss the applicability of data parallelism and influence of hardware connection mode.
  • Ethnic Language and Cross Language Information Processing
  • Ethnic Language and Cross Language Information Processing
    WEN Zixiao, BAO Feilong, GAO Guanglai, WANG Yonghe, SU Xiangdong
    2018, 32(7): 44-51,57.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a well-functioned information retrieval system for both traditional Mongolian and Cyrillic Mongolian. In the network crawling, MD5 algorithm is applied to improve the crawler performance. In the preprocessing, Mongolian documents are porcessed for code conversion, affix analysis and proofreading. The retrieval module is built upon the Vector Space Model. In addition, the Cyrillic Mongolian to the traditional Mongolian conversion module is developed to meet the application requirements.
  • Ethnic Language and Cross Language Information Processing
    Dawel Abilhayer, Nurmamet Yolwas, LIU Yan
    2018, 32(7): 52-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Kazakh, the acoustic properties of nine vowels play an important role in speech recognition. This paper investigates the vowel pattern in the polysyllabic words of Kazakh in the style of experimental phonetics. For the 1062 polysyllabic words selected from the phonetic library, we investigate the vowel patterns in first,medial and ending syllables using Joos method, and summarize the Kazakh polysyllabic words vowel formant mode. The study has enriched the speech research and application of Kazakh language.
  • Ethnic Language and Cross Language Information Processing
    LhakpaDondrub, Ngodrup, ZU Yiqing, PEI Chunbao
    2018, 32(7): 58-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of Tibetan speech synthesis, polyphonic words become the main barrier, which is still a less addressed issue. Different from the Mandarin, Tibetan dictionary alone is not enough in dealing with this issue. Based on grammatical rules and phonetic features of Tibetan language and the General Tibetan-Chinese Dictionary, this paper collects 465 polyphonic Tibetan words, and conducts a statistical analysis among 372320 sentences. The formation, occurence, and pronunciation rule of these words are investigated, and a disambiguation method is designed. Experimental results are provided, validating the method as a promising front-end text analysis support to Tibetan speech synthesis system.
  • Ethnic Language and Cross Language Information Processing
    ZHANG Xiqun, MA Longlong, DUAN Lijuan, LIU Zeyu, WU Jian
    2018, 32(7): 67-73,81.
    Abstract ( ) PDF ( ) Knowledge map Save
    The digitalization of historical documents attract increasing research interests in recent years. Focusing on layout analysis, the essential step in digitizing historical documents, this paper proposes a convolutional denoising auto-encoder approach to historical Tibetan documents. Firstly, the document images are clustered into superpixel blocks. Then, we use the convolutional autoencoder to extract features from these blocks. Finally, the superpixel blocks are classified by the SVM classifier, thus the different parts of the Tibetan historical document are identified. Experiments on the dataset of historical Tibetan documents show that our method can effectively separate the different layout elements of Tibetan historical documents.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    REN Liyuan, XIE Zhenping, LIU Yuan
    2018, 32(7): 74-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic document summarization aims to extract brief and important information from massive texts. In order to further explore novel features for text summarization, knowledge network is introduced to model document information. Specifically, key words of documents are viewed as network nodes, sentences are represented as the paths of sequential key words on knowledge network. Then, the feature model for the penetrability of key words is proposed, in which width and depth of penetrability of key words are defined to measure each sentence. A maximum entropy based document summarization model is implemented with the proposed feature, which is validated in the experiments for its effectiveness.
  • Information Extraction and Text Mining
    ZHAO Zhehuan, YANG Zhihao, SUN Cong, LIN Hongfei
    2018, 32(7): 82-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    Protein-protein interaction extraction can be widely applied to the field of life science research. Most of the machine learning based methods focused on binary relationship extraction for high precision, while the rule based strategy can extract complex relations (“protein1, relational word, protein2”) with low recall. This paper proposes a hybrid protein-protein interaction extraction method. In this method, machine learning methods are first applied to recognize protein entities and extract relational protein pairs. Then, the syntactic patterns and a dictionary are employed to find out corresponding relational words that represent the relationship between two proteins. This method obtains a F-score of 40.18% on the AImed corpus, outperforming any of the two methods alone.
  • Information Extraction and Text Mining
    YANG Changqing
    2018, 32(7): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    The unsupervised attribute selection algorithm does not consider the classification information and the low rank of attributes. To address this issue, this paper proposes an attribute selection algorithm combining K-means clustering and low-rank constraint. The algorithm embeds the self-expression method into the framework of the linear regression model. At the same time, the K-means clustering is used to generate the pseudo-class label to maximize the class spacing to better sparse the structure. The algorithm uses l2,p-norm instead of the traditional l2,1-norm, which can adjust the sparsity of the result flexibly by parameter p. It is also proved that the algorithm has the characteristics and convergence of linear discriminant analysis. The experimental results show that the accuracy of the proposed algorithm is 17.04%, 13.95%, 3.6% and 9.39% higher than that of the NFS algorithm、the LDA algorithm、the RFS algorithm 、the RSR algorithm, respectively, with the lowest classification accuracy variance.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    SONG Haoyu, ZHANG Weinan, LIU Ting
    2018, 32(7): 99-108,136.
    Abstract ( ) PDF ( ) Knowledge map Save
    The open domain dialogue system is challenged by effective multi-turn dialogues. Current neural dialogue generation models tend to fall into conversation black holes by generating safe responses, without considering the future information. Inspired by the global view of reinforcement learning methods, we present an approach to learn multi-turn dialogue policy with DQN (deep Q-network). We introduce a deep neural network to evaluate each candidate sentence and choose the sentence with the maximum future rewards, instead of the highest generation probability, as a response. The results show that our method improves the average dialogue turns by 2 in the automatic eva-luation and outperforms the baseline model by 45% in the human evaluation.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    YE Yongjun, LI Peng, ZHOU Meilin, WAN Yifang, WANG Bin
    2018, 32(7): 109-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    In nowadays microblogging systems such as Twitter or Weibo, searching high quality users to follow is essential for acquiring information. This paper is focuses on the task of high quality user identification, i.e., given a domain query, return a user list according to user quality. We divide the task into two sub-problems: the user search and the user ranking. As for the user search, we represent users according to their tags and propose a similarity-based retrieval approach using the Chinese Wikipedia, which is essentially an extension of the current ESA(explicit semantic analysis) method. As for the user ranking, we propose a graph-based ranking method called UBRank, which considers both the quantity and the quality of the published posts to measure the user importance. Experiments indicate that using Chinese Wikipedia is better than other resources such as HowNet, and validate the efficiency and superiority of the ranking method.
  • Sentiment Analysis and Social Computing
    ZHANG Donglei, LIN Youfang, WAN Huaiyu, MA Yudan, LU Jinliang
    2018, 32(7): 116-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    Online communities are important platforms for technology enthusiasts or practitioners to exchange and share information. To capture the expertise and interests of each user simultaneously, this paper proposes an Author-Reader-Topic (ART) model based on the fact that a user in a community is both a producer (author) and a consumer (reader) of contents. By linking the authors and the readers of the documents, this model can improves the topic clustering, and achieves more accurate author topic distribution and reader topic distribution. We conducted an experimental comparison and analysis based on a real data set collected from the CSDN community. The experiments show that the proposed model can effectively discovery users' expertises and interests, outperforming the existing methods significantly.
  • NLP Application
  • NLP Application
    WU Yufeng, WU Shengtao, ZHU Tingshao,LIU Hongfei, JIAO Dongdong
    2018, 32(7): 128-136.
    Abstract ( ) PDF ( ) Knowledge map Save
    The previous psychological investigations of novel mainly focus on novel characters' qualitatively, relying on researchers' personal experience subjectively, which is less stable or systematic as compared to personality analysis of the characters. Using the machine learning model based on Chinese psychology analysis system and haracters' dialogues in Ordinary World, we got the four novel characters' big-five personality scores. Then, by comparing the predicted scores with documents related to characters' psychological analysis and novel descriptions, the validity of this method was established. The results revealed that: The younger characters (Shaoping Sun and Xiaoxia Tian) held relatively high Openness, while the older ones (Shaoan Sun and Runye Tian) held relatively high Extraversion. In addition, Shaoping Sun and Runye Tian had relatively high Conscientiousness, Shaoan Sun and Xiaoxia Tian had relatively high Agreeableness, and Shaoan Sun and Shaoping Sun had relatively high Neuroticism. The results demonstrate the applicability of personality analysis of novel characters, through an objective, systematic and intelligent approach.
  • NLP Application
    JIAO Qingju, GAO Feng, JIN Yuanyuan, XIONG Jing, LIU Yongge
    2018, 32(7): 137-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Predicting semantics of unknown oracle bone inscriptions is an important topic, and also the bottleneck for histories and computer scientists in oracle bone inscriptions research. Based on the massive accumulated data of oracle , we first define the distance of two oracle character by modeling on big rubbing information, then construct the network of oracle characters in this work, and, finally, analyze the degree distribution, local link rate, clustering coefficient and modularity of the network. Experiments show that the constructed network has a strong module structure, reflects the feature of more monosyllabic words and less polysyllabic words, and captures semantic units of rubbings.