2017 Volume 31 Issue 1 Published: 15 February 2017
  

  • Select all
    |
    Article
  • Article
    ZHANG Dong; LI Shoushan; WANG Jingjing
    2017, 31(1): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question classification aims at classifying the types of questions automatically, which is essential to most question answering systems. This paper proposes a method of semi-supervised question classification with jointly learning question and answer representations. It is featured by considering the question and its corresponding answer as conjunct context to learn the word distributed representation. Specifically, neural network language model is introduced to learn question and answer representations jointly, so that the word vectors of question are added more information. Secondly, large numbers of unlabeled questions and answers participate in word vectors learning, which could strengthen the representation capacity of question word vectors. Finally, we represent the questions of word vectors as training samples, adopting the convolutional neural network to construct the question classifier. The experimental results demonstrate that the method of semi-supervised question classification with synergetic representations learning in this paper can make full use of word vectors and the unlabeled samples to improve the performance, and is better than other strong semi-supervised methods.
  • Article
    TAN Hongye; ZHAO Honghong; LI Ru;
    2017, 31(1): 8-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading comprehension system is a research focus in natural language processing. In these systems,both answer extraction and sentence fusion are necessary for answering complex problems. This paper focuses on the techniques of sentence fusion for complex problems, and presents a method considering the sentence importance, the relevancy to queries and the sentence readability. This method first chooses the partsto be fused based on sentence division and word salience. Then, the repeated contents are merged by word alignments. Finally, the sentences are generated based on the integer linear optimization, which utilizes dependency relations, the language model and word salient. The experiments on reading comprehension datasets in college entrance examinations achieve an F-measure of 82.62%.
  • Article
    YE Lei; GAO Shengxiang; YU Zhengtao; QIN Guangshun; HONG Xudong
    2017, 31(1): 17-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a query expansion method based on undirected graph of event elements, which utilizes the relevance between news event elements to conduct query expansion to improve news event retrieval. Firstly, we select out the elements to be extended by analyzing the relationship between candidate events and queries. Then, we construct an undirected graph to represent the extracted event elements and the relationship between them, and compute the edge weights through event vector space. Finally, we compute the weight of event elements by the undirected graph model of node weight, and extend event elements according to the weights computed. Experimented on news event query expansion data, it is proved that the proposed query expansion method has a good effect on news event retrieval.
  • Article
    CHEN Zhipeng; CHEN Wenliang;
    2017, 31(1): 23-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Off-topic detection is important in the automated essay scoring systems. Traditional methods compute similarity between essays and then compare the similarity with a fixed threshold to tell whether the essay is off-topic. In fact, the essay score is heavily dependent on the type of topic, e.g. the essay score for divergent topic ranges very different from that of non-divergent topic. This prevents fixed threshold to identify off-topic for all essays. This paper proposes a new method of off-topic detection based on divergence of essays. We study the divergence of essays, and establish the linear regression model between divergence and threshold. Our method is featured by a dynamic threshold for each topic. Experimental results show that our method is more effective than baseline systems.
  • Article
    LI Lishuang; JIANG Zhenchao; WAN Jia; HUANG Degen
    2017, 31(1): 31-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Protein-Protein Interaction extraction (PPIE) is a significant topic in biomedical text mining. Most of the current researches on PPI are based on kernels and features.To further boost the performance, this paper presents an improved instance representation model integrating word representation and deep neural network. Meanwhile, the model incorporates feature selection, PCA and different kinds of classifiers, and finds the best combinations for PPI extraction. Experimental results show that the method is significantly better than other state-of-art methods on three public PPI corpora: AIMed, BioInfer, HPRD50, achieving the F-scores of 70.5%, 82.2% and 80.0%, respectively.
  • Article
    HU Renfen
    2017, 31(1): 41-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the automatic generation strategy of four types of vocabulary test questions: word listening, multi-word selection, word order and single word selection.. A knowledge base is built to extract word-level features including pronunciation, senses, grammars, collocations, learners errors, etc. Sentence analysis modules are also developed for automatic identification of grammatical constructions and the estimation of sentence difficulty degrees. By selecting proper sentences, target words and distractors, 7263 vocabulary test questions are automatically generated in the experiment. The manual evaluation shows that the automatic generation strategy performs well with 58% of the questions evaluated as completely reasonable. After slight manual modification, the question acceptance rate is increased to 75.7%.
  • Article
    LI Bin; WEN Yuan; BU Lijun; QU Weiguang; XUE Nianwen
    2017, 31(1): 50-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    AMR is a new representation of the abstract meaning of a sentence, which is close to the Interlingua. The English AMR corpus including the Little Prince has been released. The major differences between AMR and the previous syntactic and semantic representation lie in two aspects. First, AMR uses a graph. Second, it allows adding concept nodes which are omitted in a sentence. In this paper, we design the Chinese AMR annotation specification and construct the Chinese Little Prince AMR corpus, achieving an inter-agreement Smatch value is 0.83. The bilingual comparison shows that the graph structures in English and Chinese sentences are highly correlated. With a proportion of 40% sentences having graph structure. But the added concept nodes are different. We also discuss AMRs ability to represent the semantic meaning of Chinese sentences as well as the advantages of AMR in cross language comparison.
  • Article
    YU Dong; ZHAO Yan; WEI Linxuan; XUN Endong;
    2017, 31(1): 58-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a unified model for matrix factorization based word embeddings, and applies the model to Chinese-English cross-lingual word embeddings. It proposes a method to determine cross-lingual relevant word on parallel corpus. Both cross-lingual word co-occurrence and pointwise mutual information are served as pointwise relevant measurements to design objective function for learning cross-lingual word embeddings. Experiments are carried out from perspectives of different objective function, corpus, and vector dimension. For the task of cross-lingual document classification, the best performance model achieves 87.04% in accuracy, as it adopts cross-lingual word co-occurrence as relevant measurement. In contrast, models adopt cross-lingual pointwise mutual information get better performance in cross-lingual word similarity calculation task. Meanwhile, for the problem of English word similarity calculation, experimental result shows that our methods get slightly higher performance than English word embeddings trained by state-of-the-art methods.
  • Article
    SUN Shichang; LIN Hongfei; MENG Jiana; LIU Hongbo
    2017, 31(1): 66-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Transfer learning alleviates the data sparseness issue to some extent, but the generalization capacity is still hindered by negative-transfer problem. To address this issue, we propose an information granulation method for text corpora based on source domain structure. Interval granules are employed to express the influence of source domain structure on statistics of the dataset. We further design an Interval Type-2 fuzzy Hidden Markov Model (IHMM) to deal with the interval granules. Experiments on part-of-speech tagging proves that the proposed method avoids negative-transfer and improves generalization capacity.
  • Article
    ZANG Jiaojiao; XUN Endong
    2017, 31(1): 75-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper conducts a research on the automatic recognition of separable words from the perspective of Chinese information processing. It summarizes recognition rules and design a recognition algorithm considering the separable forms derived from the large-scale corpus. The algorithm achieves 91.6% accuracy after a continuous optimization in the corpus of two billion words. Error analysis reveals that the morphemes with strong word-fromation ability, incorrect word segmentation and POS tagging, incomplete rules, and errors in the corpus accounts for most of the mistakes..
  • Article
    HE Baorong; QIU Likun; XU Dekuan
    2017, 31(1): 84-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Ba-sentence is a typical Chinese sentence pattern. This paper proposed a rule-based method for automatic semantic role labeling, with a special focus on ba-sentences. Firstly, we collect a set of ba-sentences from our annotated semantic corpus, including texts from Peoples Daily, and thus forming a sample gallery of ba-sentences. Then, we manually annotate the valence type of each predicate, the syntactic structure type and semantic structure type of each ba-sentence. Based on this annotated corpus, we analyzed the rules of semantic formation, and summed up several rules of semantic role labeling. Finally, we evaluated these rules in a test set yielding an overall precision is 98.61%.
  • Article
    KANG Sichen; LIU Yang;
    2017, 31(1): 94-101.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese word similarity computing plays an important role in the Chinese information processing. Based on the notion of character-orientation, Chinese semantic word-formation knowledge, including word POS, word-formation pattern and morphemic concepts, is employed to compute Chinese word similarity. This lexical knowledge representation is simple, intuitive and easy to expand and the model is straight-forward, with characteristics and parameters adopted as less as possible. Experimental results show that the approach is promising for the typical sampling word pair. Also, the numerical values of similarity are more in line with human cognition and present a reasonable distribution of the global data.
  • Article
    SUN Yuan; ZHAO Qian;
    2017, 31(1): 102-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    In contrast to the, To discover synchronication topics associated in Tibetan and Chinese social networking, we build LDA topic model on the basis of Tibetan-Chinese comparable corpus, with word2vec as the input and Gibbs sampling to estimate model parameters. To align Tibetan topics and Chinese topics, we calculate the similarity between Tibetan and Chinese topics according to the distribution of text-topic disctrbution via a voting method based on cosine distance, Euclidean distance, Hellinger distance and KL distance.
  • Article
    ZHENG Yanan; ZHU Jie;
    2017, 31(1): 112-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Part of Speech (POS) tagging is fundamental to Tibetan processing, with a wide applications in Tibetan text classification, information retrieval, machine translation and other fields. This paper proposes a method of Tibetan POS tagging based on distributed representation. First, this method extends the dictionary by semantic approximation according to the distributed representation. Then the POS tagging is completed according to the dictionary and the semantic similarity. Experimental results show that this method can expand the dictionary with a better result.
  • Article
    BAI Shuangcheng;
    2017, 31(1): 118-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Mongolian language model for its text is challenged by the same character with different codes owing to the different pronunciations of the character in various contexts. To address this issue for spelling input, this paper adopts a large dictionary with correct pronunciations, training a statistical spelling model to maximize the the pronunciation sequence directly from the candidate code sequence. Experiments indicate a more efficient spelling input method is achieved, which is also enlightening for “pronunciation-to-word” coversion and “spelling-to-word” conversion.
  • Article
    Merhaba Eset; Azragul;Yusup Abaydulla
    2017, 31(1): 126-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    The sentiment vocabulary is essential for the sentiment analysis. To deal with the inefficiency of manual acquisition, this papers proposes an extension of features based on the grammar and context characteristics of Uyghur sentimental words.Combined with the TF-IDF measure, our algorithm is proved to effectively improve the recognition of sentiment words.
  • Article
    Alim Murat; Azragul; YANG Yating;LI Xiao;
    2017, 31(1): 133-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper studies the web-page identification task for Uyghur. It first develops the the character encoding conversion rules for non-standard Uyghur characters in the webpages. Then, two identification approaches are described: one is the modified N-Gram method (MNG) method and the other is that a feature vector method (utilizing the frequent Uyghur words via an VSM ). The experimental datasets constitute of three different types of Uyghur web-pages. The results show that N-Gram based approach performs better in identifying web-pages with long texts as in news site and forum, while the feature vector approach out-performes in web-pages of short text. Combining these two methods yields above 90% F1 score in the experiment.
  • Article
    LI Yang; GAO Daqi
    2017, 31(1): 140-146.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entities similarity is useful in many areas, such as recommendation system in E-commerce platforms, and patients grouping in healthcare, etc. In our task of calculating the entity similarity in a given knowledge graph, the attributes of every entity is provided, and a sample of entity pairs are provided with their similarity score. Therefore, we treat this task as a supervised learning problem, testing SVM, Logistic Regression, Random Forest, and Learning to rank models.
  • Article
    WANG Ruibo; LI Jihong; LI Guochen; YANG Yaowen
    2017, 31(1): 147-154.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic role identification is an important task for semantic parsing according to Chinese FrameNet. Based on distributed representations of Chinese words, the part-of-speech and other symbolic features, we build our semantic role identification model by employing a kind of multi-feature-integrated neural network architecture. Due to the relative small training corpus, we adopt the dropout regularization to improve quality of the training process. Experimental results indicate that, 1) dropout regularization can effectively alleviate over-fitting of our model, and 2) the F-measure increases upto 7%. With further optimization of the learning rate and the pre-trained word embeddings, the final F-measure of our semantic role identification model reaches 70.54%, which is about 2% higher than the state-of-the-art result.
  • Article
    JIA Yuxiang; XU Hongfei; ZAN Hongying
    2017, 31(1): 155-161.
    Abstract ( ) PDF ( ) Knowledge map Save
    Selectional preference describes the semantic preference of the predicate for its arguments. It is an important lexical knowledge for the syntactic and semantic analysis of natural languages. Neural network models have achieved state-of-the-art performance in many natural language processing tasks. This paper deploys neural network models for selectional preference acquisition, including a one-hidden-layer feedforward network with pre-trained word vectors and a maxout network. In the pseudo-disambiguation experiments on Chinese and English, neural network models both outperform a LDA-based selectional preference acquisition model.
  • Article
    XIE Jun; HAO Jie; SU Jingqiong; ZOU Xuejun; LI Siyu
    2017, 31(1): 162-168.
    Abstract ( ) PDF ( ) Knowledge map Save
    The joint topic and sentiment model is aimed at efficiently detecting topics and emotions for the given corpus. Faced with the sparsity of short texts and the lack of sentiment/topic analysis methods, this paper proposes a novel way called Biterm Joint Sentiment Topic Model (BJSTM). A sentiment layer is added to Biterm Topic Model, thus a three-layer Bayesian model of “sentiment-topic-term” is formed. By sampling the sentiment and topic of each biterm, BJSTM could depict the word co-occurrence of the whole corpus and overcome the sparsity of short texts to some extent. The experimental results show that BJSTM gets better performance in sentiment classification as well as topic extraction.
  • Article
    WU Dongyin; GUI Lin; CHEN Zhao; XU Ruifeng
    2017, 31(1): 169-176.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment analysis is an important topic in natural language processing research. Most existing sentiment analysis techniques are difficult to handle the domain dependent and sample bias issues, which restrain the development and application of sentiment analysis. To address these issues, this paper presents a sentiment analysis approach based on deep representation learning and Gaussian Processes transfer learning. Firstly, the distributed representations of text samples are learned based on deep neural network. Next, based on deep Gaussian processes, this approach selects quality samples with the distribution similar to testing dataset from additional dataset to expand the training dataset. The sentiment classifier trained on the expanded dataset is expected to achieve higher performance. The experimental results on COAE2014 dataset show that the proposed approach improved the sentiment classification performance. Meanwhile, this approach alleviates the influences of training sample bias and domain dependence.
  • Article
    Rexidanmu Tuerhongtai;Wushour Silamu;Yierxiati Tuergong
    2017, 31(1): 177-183.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of the Internet, a large number of online Uyghur texts appeared, which demands sentiment analysis for different applications. Considering there are not neither enough training data nor a complete sentiment lexicon for Uyghur sentiment analysis, this paper combines the Lexicon-based method with Corpus-based method, proposing a so-called LCUSCM (Lexicon-based and Corpus-based Uyghur Text Sentiment Classification Model). It first classifies the text by using a manual-built Uyghur sentiment dictionary, with the lexicon is enriched incrementally in this process.Then, the reliable classified sentences are selected to train a classifier so as to refine the results of the first step. The accuracy of the hybrid method increased 9.13% than using machine learning method, and 1.82% than the lexicon based method.
  • Article
    REN Han; FENG Wenhe; LIU Maofu; WAN Jing
    2017, 31(1): 184-191.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces an approach of textual entailment recognition based on language phenomena. The approach asopts a joint classification model for language phenomenon recognition and entailment recognition, so as to learn two highly relevant tasks, avoiding error propagation in pipeline strategy. For language phenomenon recognition, 22 specific and 20 general features are employed. And for enhancing the generalization of random forest, a feature selection method is adopted on building trees of random forest. Experimental results show that the joint classification model based on random forest recognizes language phenomena and entailment relation effectively.
  • Article
    ZHONG Yu; FEI Dingzhou
    2017, 31(1): 192-204.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a new method is presented to identify personality with dimension reduction by sparse principal component analysis (SPCA). Based on categories of linguistic inquiry and word count dictionary (LIWC), informal words usage and psychological trait in instant chat is analyzed, and the relation between informal words and personality is described. Biterm Text Model (BTM), psychological distance questionnaire and Big Five personality questionnaire are used to measure personality and related variables. The informal words dimensions are explained based on simplified Chinese version of linguistic inquiry and word count dictionary and cognitive linguistic usage. It is shown that the numbers of load factors gotten by the SPCA more stable than the numbers of traditional principal component analysis(PCA), and the cumulative explained variances are better (24.54%>23.40%). With respect to 6 dimensions, “subjective evaluation” was positively related to agreeableness (r0.05=.16, p=.03<0.05), “casual socializing” was negatively related to agreeableness (r0.05=-.16, p=.03<0.05), while “cognitive pleasure” and gender were significantly positively related (r0.05=.43, p=.00<0.001). These results suggest that SPAC for dimensional reduction performs better PCA in related studied issues.
  • Article
    CHEN Zhenning; CHEN Zhenyu
    2017, 31(1): 205-211.
    Abstract ( ) PDF ( ) Knowledge map Save
    Cluster analysis is the task of grouping a set of objects by associations of these objects. The diameters of cluster and association analysis are similarity measures, which often involves the absolute similarity of the symmetry property. But most rules found in natural languages are inclined and have asymmetrical forms. We describes the asymmetrical associationby a parameter of Probability Entailment, i.e. the conditional probability, to represent the asymmetrical associations among features. And then we define the Domination Relation, the Tight Relation, the Control Center, and the Midway island. A strategy for cluster based on inclined similarity measures is presented to deal with issues likethe false isolated points, data sparsity and family iconicity.
  • Discourse Annotation and Reasoning
  • Discourse Annotation and Reasoning
    FENG Wenhe;GUO Haifang;LI Yujing;REN Han
    2017, 31(1): 212-220.
    Abstract ( ) PDF ( ) Knowledge map Save
    By labeling the structure of Shishuoxinyu, we study the connectives on their explicit and implicit forms, semantic meanings, and usages. It is revealed that: 1) implicit relationships (3 346 instances covering 81.4%) are more than explicit relationships (786 instances covering 18.6%), with only 3 among 17 types relationships (namely, hypothesis, selection, concession ) are more explicit; 2) types and usages of the synonymous connectives of all the relationships are different, with a max number of 36 types for “continue”, and the least as null for “summary-elaboration” and “background”; 3) for the 90 connectives, 55 of them are the monosemy, while the rest 35 are polysemy. It is found that there are some differences in the use of connectives between shishouxinyu and WenXin Diao long, both of which written in the same period.
  • Article
  • Article
    WANG Jing; YANG Lijiao; JIANG Hongfei; SU Jingjie; FU Jingling
    2017, 31(1): 221-229.
    Abstract ( ) PDF ( ) Knowledge map Save
    In field of teaching Chinese as a second language, the teaching of word is very important, in which polysemous word is a challenging issue. After a survey of 3 classical vocabularies in this field, this paper selects 1 181 polysemous words. Then an annotation specification is designed, with a reference to Modern Chinese Dictionary (Edition 6). Tagging the 1 181 words appeared in 197 popular Chinese textbooks yields a corpus with word senense annotation over 3.5 million characters. A quantitative study on the 1 811 polysemous words is also made, with an analysis of the distribution of total 4 323 word senses.
  • Language Resources Constrution
  • Language Resources Constrution
    ZHAN Weidong
    2017, 31(1): 230-238.
    Abstract ( ) PDF ( ) Knowledge map Save
    Although the construction grammar has already drawn much attention in the field of Chinese linguistics, scholars in the field of natural language processing are rarely concerned with this approach to syntactic and semantic parsing of Chinese. This paper proposes a linguistic engineering project of building a knowledge database of Chinese constructions. Some key issues in this task are discussed, including the differences between the construction and the classical grammar units, the formalism of the construction, the types and features of the construction and so on..