2018 Volume 32 Issue 2 Published: 15 February 2018
  

  • Select all
    |
    Survey
  • Survey
    WU Lianwei, RAO Yuan, FAN Xiaobing, YANG Hao
    2018, 32(2): 1-11,21.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are a large number of rumors, extreme and fake news in network, which will reduce quality of information, destroy the credible atmosphere of internet, and produce the serious negative effects for the occurrence and development of public opinion. To measure the credibility of information, the paper divides incredibility contents into such types of extreme emergency information, network extreme information, network rumors, misinformation, disinformation, spam information and so on. And the information contents are studied from the following aspects: concept, content features description, credibility modeling and credibility evaluation, which provides a solid foundation for credibility analysis and measurement of information content in social networks. Finally, we further analyze the directions of development in current research of credibility of information.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    LIU Yang, LIN Zi, KANG Sichen
    2018, 32(2): 12-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Morphemes and word-formation analysis are the starting point for the semantic analysis of Chinese as parataxis language, and also the key to understanding the meaning of words. This paper presents a novel approach to exploring the Chinese Semantic Primitives and using them for word meaning analysis: first form the Synonymous Morpheme Sets, used for denoting the Morphemic Concepts, based on similarity calculation of Chinese morpheme glosses; then form the Morphemic Concept Hierarchy, serving as a systematic description of the Chinese Semantic Primitives, by principles of the Generative Lexicon Theory; built on these, Chinese Semantic Word-formation Analysis has made new progress from overall consideration and data mining. These ideas, practices and language resources are expected to promote the humanities and computing applications as well.
  • Language Analysis and Calculation
    FU Jianhui, WANG Shi, CAO Cungen
    2018, 32(2): 22-28,49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Metaphor is popular in any natural language, and metaphor recognition is one of the challenging topics in natural language processing. Existing classification based metaphor recognition methods suffer from data sparsity, which affects the performance of the classification. In this paper, we propose a metaphor phrase recognition method by combining classification and clustering methods to improve the performance. This method firstly conducts the clustering on phrases with source words S, and then uses the clustering results as the features for classification. The classifier also produces a satisfactory performance for those phrases which miss source words. Several experiments show that our methods achieve a high recall rate.
  • Morphology,Syntax,Semantics Analysis
  • Morphology,Syntax,Semantics Analysis
    JIN Chen, LI Weihua, JI Chen, JIN Xuze, GUO Yanbu
    2018, 32(2): 29-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word segmentation (CWS) is a fundamental issue of Chinese language processing (NLP). which affects the subsequent NLP tasks substantially. At present, the state-of-the-art solution is based on the classical machine learning model. Recently, Long Short-term Memory (LSTM) model has been proposed to solve the long-term dependencies in classical RNN model, and already well daapted in various kinds of NLP tasks. As for CWS task, we add a layer of backward LSTM based on unidirectional classical LSTM to build a Bi-directional Long Short-term Memory Neural Network model (Bi-LSTM). And we also propose a contribution rate to balance the matrix’s value in forward LSTM layer and backward LSTM layer. We design four experiments to demonstrate that our model is reliable and preferable.
  • Morphology,Syntax,Semantics Analysis
    WANG Dongsheng, WANG shi, WANG Weimin, FU Jianhui, ZHU Feng
    2018, 32(2): 38-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Accurate understanding of users’ intentions is the key to domain specific question answering (QA) systems while open domain QA systems can always make use of data redundancy technologies to improve performances. In this paper, we first propose a new robust constrained semantic grammar, which can resolve parsing ambiguities in word, syntax and semantic layers with the support of domain ontology. We then employ an efficient matching algorithm to deal with matchings inconsistent with the constraints of grammar rules. Finally, the candidate matchings are ranked based on several features, including density of matching words, historical matching accuracy of rules, matching relatedness and unrelatedness. In order to verify the validity of the proposed method, we apply the method to two domain-specific QA of different scales. The experimental results show that the proposed method is effective, the understanding accuracy rates are 82.4% and 86.2%, respectively, achieving the MRR values of 91.6% and 93.5%, respectively.
  • Morphology,Syntax,Semantics Analysis
    WANG Mingxuan, LIU Qun
    2018, 32(2): 50-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a new deep long short-term memory (LSTM) network equipped with elevator unit (EU) for semantic role labeling (SRL). The EU conducts a linear combination of adjacent layers which allows unimpeded information flowing across several layers. With the EU, a very deep stack LSTM up to 20 layers can be easily optimized. Specially, the connection also contains a gate function to regulate and control the information flow through space direction. The appropriate levels of representations are directly guided to the output layer for predicting the corresponding semantic roles. Although the model is quite simple with only the very original utterances as the input, it yields strong empirical results. Our approach achieves 81.56% F1 score on CoNLL-2005 shared datasets, and 82.53% on CoNLL-2012 shared datasets, which outperform the previous state-of-the-art results by 0.5% and 1.26%, respectively. Remarkably, we obtain surprisingly improvement in out-domain datasets by 2.2% in F1 score compared with previous state-of-the-art system. The model is simpl, and easy to implement and parallelize, yielding 11.8k tokens per second on a single K40 GPU.
  • Language Resources Construction
  • Language Resources Construction
    ZHANG Dakui, YIN Dechun, TANG Shiping, MAO Yu, FAN Xiaozhong
    2018, 32(2): 58-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the optimization of Chinese word segmentation algorithms, the performance of a word segmenter is more dependent on the coverage and completeness of the training corpus. Therefore, how to quickly, effectively, automatically build word segmentation corpus has become a pressing issue to be addressed. This paper aims to explore the valuable natural word segmentation information, which is produced when users type in Chinese text. This information provides a new perspective for building Chinese segmentation training corpus, which is less touched in the literature. In this paper, we have shown that user-produced word segmentation information can be used to segmentation corpus, and its performance is acceptable. Moreover, some texts with this information from the excellent users are very close to the gold standard segmentation result. In this study, we use the classification model and the voting mechanism to find three of these excellent users, and get texts with natural word segmentation information. Experimental results show that these texts can be used to build segmentation training corpus, which greatly improves the accuracy of the segmenter.
  • Other Language in/around China
  • Other Language in/around China
    AN Shuaifei, BI Yude, ZHANG Ting
    2018, 32(2): 66-74,80.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the purpose of automatic dividing the complex sentences into single sentences, This paper starts with the embedded complex sentences of Korean by analyzing its attributive clause feautres, including the rules of its left and right borders and the co-occurrence of internal constitution. A rule based Korean attributive clause detection method is then proposed, which can lay a solid foundation for improving the efficiency of machine translation and other application system.
  • Other Language in/around China
    YAN Xiaodong, HUANG Tao
    2018, 32(2): 75-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    In parallel to the emotion dictionary based text sentiment analysis of Chinese and English, we build a comprehensive and efficient polar dictionary manually, including Basic Dictionary, Privative Dictionary, Degree dictionary and Turning Dictionary. Then, we detect the polarity phrase consistituted by polarity words and qualifiers, and investigate the influence on sentence emotion polarity by adversatives. The proposed Tibetan text Sentiment analysis method based on polar dictionary is finally validated in the experiments with a good result.
  • Other Language in/around China
    Halidanmu Abudukelimu, SUN Maosong, LIU Yang, Abudukelimu Abulizi
    2018, 32(2): 81-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    THUUyMorph (Tsinghua University Uyghur Morphology Segmentation Corpus) is an Uyghur corpus with morpheme segmentation annotations. The original corpus is downloaded from Tianshan website in 2016, including news, law, life, etc. Corpus are processed by proofreading of the original corpus, clauses segmentaion and proofreading, manual and automatic annotation for morpheme segmentation, manual annotation of phonetic harmony phenomenon, manual correction of morpheme segmentation and phonetic harmony. The corpus contains 10, 596 documents, 69, 200 sentences and 89, 923 word types, which are annotated at both word-level and sentence-level. The corpus is available at http://thuuymorph.thunlp.org/.
  • Other Language in/around China
    FENG Wei, YI Mianzhu, MA Yanzhou
    2018, 32(2): 87-93,101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Grapheme-to-phoneme conversion (G2P) plays a very important role in the resources construction of Russian speech information processing. This paper attempts to improve and design a new Russian phoneme set based on SAMPA, enabling the transcription results to reflect the stress position and vowel reduction of Russian words. After constructing the 20, 000-word Russian pronunciation dictionary according to the new phoneme set, this paper implements a data-driven Russian G2P algorithm, emloying the Weighted Finite-State Transducer (WFST) for alignment, model building and decoding. First, the “multiple-to-multiple” alignment algorithm based on Expectation Maximization algorithm is applied to Russian grapheme and phoneme sequences. Then, the joint N-gram model is trained based on the alignment result and converted into WFST as pronunciation model. Finally, the pronunciation of a novel input word can be predicted through WFST decoding algorithm. In cross-validation experiments, the average word accuracy is 62.9%, and the average phoneme accuracy is 92.2%.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LIU Mingtong, ZHANG Yujie, XU Jinan, CHEN Yufeng
    2018, 32(2): 94-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method of paraphrasing pattern acquisition based on deep semantic computing. We design a sentence segmentation method based on statistical features to obtain high-quality relational patterns from non-paraphrasing corpus, and then the paraphrasing patterns are detected by entity relation patterns. Finally, the patterns are automatically clustered according to their semantic similarity. Our experimental results on the four types of entity relation show that our method acquired paraphrasing patterns with good performance, with more diversity and closer semantic relation.
  • Information Extraction and Text Mining
    JIANG Wei, JIN Zhong
    2018, 32(2): 102-109,119.
    Abstract ( ) PDF ( ) Knowledge map Save
    In text classification, bidirectional recurrent neural network based on word-level attention is defected in the way generating text representation directly from words, which will cause a lot of information loss and make it hard to train the network on a limited data. In fact, words need to be combined into phrases with clear semantics in the context, and the text semantic meaning is often determined by several key phrases, therefore, the text representation generated by learning the weight of phrases may be more precise than that by the words. This paper proposes a novel neural network architecture based on the phrase-level attention mechanism. Specifically, a convolutional layer is added after the word embedding layer to extract the representations of N-gram phrase, and the text representation is learnt by bidirectional recurrent neural network with attention mechanism. We test five kinds of attention mechanisms in the experiment. Experimental results show that a series of NN-PA models based on different attention mechanism can improve classification performance on both of small and large scale datasets, and converge faster. Both NN-PA1 and NN-PA2 models outperform the state-of-art models based on deep learning techniques, and NN-PA2 gets 53.35% accuracy on the five-classification task on Stanford Sentiment Treebank, which is best result to our best knowledge.
  • Information Extraction and Text Mining
    SHEN Yatian, HUANG Xuanjing, CAO Junkuo
    2018, 32(2): 110-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with opinion word and opinion target extraction, we explore several variants of long short-term memory recurrent neural networks for joint extraction of them at sentence-level. We also compare our models with previous classical approaches. The results of the experiments show that long short-term memory recurrent neural networks outperform previous baselines, achieving new state-of-the-art results for joint extraction of fine-grained opinion words and opinion targets.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    MU Yongli, LI Yang, WANG Suge
    2018, 32(2): 120-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    Emotion cause detection, as a new and challenging issue, plays an important role in the field of text emotion analysis. In this paper, we present a new emotion cause detection method based on ensembled convolution neural networks. This proposed method combines the semantic information by operations such as word embedding, convolution, pooling, and it uses multiple CNN integrations to reduce the impact of data imbalance on emotional reasons detection and relieve tedious processes, such as rule-making, feature extraction, feature dimension reduction and so on. The experiments results show the proposed method has a good performance.
  • Sentiment Analysis and Social Computing
    ZHANG Mi, ZHANG Hui, YANG Chunming, LI Bo, ZHAO Xujian
    2018, 32(2): 129-138,146.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion leaders in online social network are those who have great social influence. Current opinion leader mining methods consider only the topological structure of a social network and the node’s attributes, neglecting the interaction in information diffusion.This paper proposes an opinion leader mining model based on the independent cascade model named EIC (Extended Independent Cascade), which incorporates the network structure, the node attributes, and the user behavior characteristics to build a weighted diffusion network. Experimental results of real data collected from Sina Weibo show that the proposed algorithm is superior to the topological structure algorithms in the extended core rate of opinion leaders, without under-estimate the the scope of influence.
  • NLP Application
  • NLP Application
    MU Wanqing, LIAO Jian, WANG Suge
    2018, 32(2): 139-146.
    Abstract ( ) PDF ( ) Knowledge map Save
    Parallelism has the advantages of compact structure, neat sentence, expressiveness and other distinctive features in all kinds of literary forms. In recent years, parallelism has also been found as the problem of appreciation in the Chinese college entrance examination, but the research of automatic recognition is rarely touched. In this paper, according to the characteristics of the similar syntactic structure and content relevance in parallelism, we design a method of combining the convolutional neural network and the structure similarity to recognition parallelism. We first use the word embedding and the vector of part-of-speech as the sentence distributed representation, employing multiple convolution kernels to execute the convolution operation, so as to realize the parallelism recognition method based on convolutional neural network. Using the parts of speech of the clauses string to create similarity calculation, we then emplement the parallelism recognition based on structure similarity calculation. Taking account of the semantic relevance and the structure similarity of the sentences, we combine the two methods to recognize parallelism. The experimental results show that the proposed recognition parallelism method is effective in the literature dataset and literature reading material datasets of the Chinese college entrance examination.