2018 Volume 32 Issue 5 Published: 15 May 2018
  

  • Select all
    |
    Survey
  • Survey
    Tuergun Ibrahim, Kahaerjiang Abiderexiti, Aishan Wumaier, Maihemuti Maimaiti
    2018, 32(5): 1-13,21.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper reviews the progresses of natural language processing of Turkish, Kazakh and so on, which belong to the same language family in Central Asia. First, morphological analysis, syntax analysis, named entity recognition and machine translation are reviewed. Then the language independent methods for agglutinative language morphological analysis are discussed. Finally, problems and challenges of Central Asian language processing at home and abroad is summarized, and future studies are suggested.
  • Language Analysis and Calculation
  • Language Analysis and Calculation
    ZHOU Xiaoqiang, WANG Xiaolong, CHEN Qingcai
    2018, 32(5): 14-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Interactive question answering(iQA) is a kind of information communication with conversational, continuous, correlated characteristics. The relation structure of iQA reflects the contextual association of the interactive scenario at different linguistic levels. This paper summarizes the dialogue act and sentence relation of iQA, and presents the relation structure taxonomy based on the results of analysis. To verify the rationality of the taxonomy, we conduct annotation experiments on the collected iQA corpus from the real world, which include dialogue act and contextual sentence relation. For iQA, we use the Hiddden Markov Model(HMM) to analyze the transformational rules of dialogue act, and show the characteristics of sentence relation structure with the statistical analysis.
  • Language Analysis and Calculation
    LIU Zuoguo, CHEN Xiaorong
    2018, 32(5): 22-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper present an Entity-Action Relationship Model (EARM) for text clustering with a purpose to describe Chinese semantic entities and behaviors. Since Chinese is a non-inflection language, we cannot easily find a one-to-one relationship between word properties and syntax elements at the surface level. A syntax element recognition mechanism is designed to recognize entities and actions according to words properties and position characters. Then EARM is built according to sentence patterns so as to describe the entities' behaviors and states. For some complex sentences, e.g. the nested sentences, it is necessary to launch action layer decomposition and simplify them into simple sentences in order to mine Entity-Action Relationship during the period of syntax analysis. For the omission and inversion in the syntaxa recognition mechanism is designed to move entities and reorder sentences by matching inverted sentences with similar sentence patterns. Maximum Common Sub-graphs of syntax trees are introduced to calculate text similarity and take clustering. Finally, the experiment shows that EARM is accurate and effective and the clustering result is reasonable.
  • Language Resources Construction
  • Language Resources Construction
    ZHANG Yinbing, SONG Jihua, PENG Weiming, ZHAO Yawei, SONG Tianbao
    2018, 32(5): 31-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper puts forward a method to convert a phrase structure Treebank to a sentence structure Treebank. Taking Tsinghua Constituent Treebank(TCT) as the test corpus, we realize the conversion of the two structures, together with the visualization display of them using a scalable visualization system. This study enlarges the scale of Chinese sentence structure Treebank, which will promote the follow-up researches of it.
  • Machine Translation
  • Machine Translation
    Wunier, Suyila, LIU Wanwan, REN qingdaoerji
    2018, 32(5): 42-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    The machine translation based on recurrent neural network is gradually replacing the statistical machine translation, especially between the major languages in the world. I Due to the shortage of Mongolian corpus, a method of Mongolian-Chinese machine translation based on Convolutional Neural Network is proposed. In the process of encoding the source corpus, through the pooling layer, the semantic relation and information of CNN in the sentence can be obtained according to the characteristics of Mongolian word formation. The experimental result shows that the method outperforms RNN NMT in the aspect of the quality and training speed of the translation.
  • Ethnic Lauguage and Cross Language Information Processing
  • Ethnic Lauguage and Cross Language Information Processing
    HUANG Xiaohui,LI Jing
    2018, 32(5): 49-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    The recurrent neural network and the connectionist temporal classification algorithm are applied to the acoustic modeling of Tibetan speech recognition, so as to achieve end-to-end model training. According to the relationship between the input and output of the acoustic model, the time domain convolution operation on the output sequence of the hidden layer is introduced to reduce the time domain expansion of the network’s hidden layers. Experimental results show that the recurrent neural network model achieves better recognition performance in Tibetan Lhasa phoneme recognition compared with the traditional acoustic models based on Hidden Markov Model, while the acoustic model based on recurrent neural network with time-domain convolution possesses higher training and decoding efficiency while maintaining the same recognition performance.
  • Ethnic Lauguage and Cross Language Information Processing
    QIN Yue, YU Long, TIAN Shengwei, FENG Guanjun,
    Turgun Ibrahim, Askar Hamdulla,ZHAO Jianguo
    2018, 32(5): 56-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    Adopting deep learning mechanism, this paper apply Stacked Denoising Autoencoder (SDAE) to deal with Uyghur zero pronoun anaphora phenomenon. Firstly, word embedding trained on large-scale unlabeled Uyghur corpus is used as semantic features of candidate antecedents and zero pronouns. Secondly, according to Uyghur characteristics, we extract 14 hand-crafted features for zero pronoun resolution. Experimental results show that, compared to SAE(Stacked Autoencoder), SVM and ANN, the F value of SDAE is increased by 4.450%, 10.032% and8.140%, respectively.
  • Ethnic Lauguage and Cross Language Information Processing
    HU Wei, YU Long, TIAN Shengwei,Turgun Ibrahim,FENG Guanjun,Askar Hamdulla
    2018, 32(5): 65-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    The accompanying relationship between the events is common in the Uyghur language. This paper proposes a method to identify the accompanying relationship between the Uyghur events based on deep belief network(Deep Belief Network, DBN). According to the characteristics of the Uyghur language, this paper extract 12 features which are based on the event structure information; It also applies the Word Embedding to calculate the semantic similarity between the two trigger words. The experiments show that the precision rate, the recall rate and F value of the proposed method reach 81.89%, 84.32% and 82.48%, respectively, which outperforms SVM (Support Vector Machine, SVM).
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    CUI Tongtong,CUI Rongyi
    2018, 32(5): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    The arrival of the era of network and big data enriches the information resources in cyberspace. However, the diversity and the rapid growth of data bring pressure and challenge to the storage and the effective utilization of information resources. A text fingerprint extraction method based on latent semantic analysis was presented in this paper. The proposed method is a compression representation of data information, and it is an improvement on the semantic deficiency of current fingerprint extraction methods. By this method, the semantic latent semantic features of document were obtained using singular value decomposition, and furthermore, the original document vector space was transformed into the corresponding latent semantic space. Finally, according to the random hyperplane principle, the document in the space was transformed into binary digital fingerprint, and the difference between fingerprints was measured by Hamming distance. The proposed method was verified by the similarity experiments and clustering experiments with the academic literature from CNKI. The experimental results show that the method can better characterize the semantic information of the document with accurate and effective compressed representation.
  • Information Extraction and Text Mining
    GAO Hengde, WANG Zhiqiang, LI Ru
    2018, 32(5): 80-88,96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The user's word feature obtained from the text is the basis for achieving the task of user theme modeling, interest mining, and personalized recommendation. To derive the word feature for cold start users who contain scarcely texts, this paper presents a method of merging the trust relations of the user and the word correlation, Specifically, we combine the user's trust relation matrix, words correlation matrix and the feature word matrix via probabilistic matrix factorization. The experimental results on 4 data sets from Sina microblogging and twitter show that the proposed algorithm achieves better results.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    XIONG Ling, XU Zengzhuang, WANG Xiaobin, HONG Yu, ZHU Qiaoming
    2018, 32(5): 89-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    The goal of Slot Filling (SF) is extracting certain attribute value of given entity(query) from large scale corpus. Entity search, as an important component of SF, retrieves documents referring to the given entity for other components to extracting attribute values from them. In contrast to the existing entity search based on boolean logic, we propose a cross document coreference resolution (CDCR) based entity search model. This CDCR improves the precision of IR results by filtering documents which do not contain mentions referring to the given entity. To minimize the loss of recall in filtering process, we introduce the pseudo relevant feedback method to augment the information of given entity. Experimental results show that our model outperforms the baseline by increasing the precision and F1 score by 5.63% and 2.56%, respectively.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    LI Xin, LI Yang, WANG Suge
    2018, 32(5): 97-104.
    Abstract ( ) PDF ( ) Knowledge map Save
    In text sentiment analysis, unsupervised clustering method is challenged by low precision. To improve the text similarity measure lying as key to clustering, this paper proposes a semantic subspace (RESS) method to deal with the high dimension and sparseness of sentiment text representation issue. It also helps to caputure the implicit expression of sentiment. The experimental results show that RESS can effectively reduce the feature of data set and generat better results.
  • Sentiment Analysis and Social Computing
    ZHANG Shaowu, SHAO Hua,LIN Hongfei,YANG Liang
    2018, 32(5): 105-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the explosive growth of networks, the internet public opinion becomes a non-negligible issue. A typical example is focus on the events about Xinjiang Violence happened in recent years. In order to examine the corresponding public opinion trends, this paper investigates the key words of topic and the change of both topic strength and its content. On the crawled news about Xinjiang Violence from 2013.01 to 2015.12, we apply the dynamic topic model (DTM) which generate topics by applying NMF twice. Compared to HDP, we reveal some properties by visualized analysis.
  • NLP Application
  • NLP Application
    LIU Gang, FU Weiping, MA Yingge
    2018, 32(5): 114-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    The article constructed a complex network system, which was composed of a micro complex network, a meso complex network, and a macro complex network. The article constructed a micro policy network, a meso policy network and a macro policy network respectively by the improved similarity calculation method between policy words based on semantic, dependent clause analysis method and the method based on the Vector Space Model. On the basis of the policy network system, the article extracted the hierarchy of the macro network, and cleared the poli\|cy fragments existed in the network, and constructed an ordered policy forest. Then it introduced a policy anti-fragmentation method based on the forest. The experimental results indicate that the methods proposed in this article can solve the problem of policy fragmentation effectively.
  • NLP Application
    MA Hongchao, ZHANG Kunli,ZHAO Yueshu,ZAN Hongying,ZHUANG Lei
    2018, 32(5): 128-136.
    Abstract ( ) PDF ( ) Knowledge map Save
    The information extraction and assistant diagnosis of obstetric EMRs is of great significance in improving the fertility level of the population. Since the admitting diagnosis in first course record of EMR is reasoned from the information of chief complaints, auxiliary examinations, physical examinations etc, we treat the diagnostic process into multi-label classification problem. The features of LDA extraction and the digital features of medical records are fused into new features by vector merging, and RAkEL, MLkNN, CC and BP-MLL are used for multi-label classification. The experimental results show that the proposed method can improve the assistant diagnosis of Chinese obstetric electronic medical records.
  • NLP Application
    HU Guoping, ZHANG Dan, SU Yu, LI Jia, LIU Qingwen, WANG Rui
    2018, 32(5): 137-146.
    Abstract ( ) PDF ( ) Knowledge map Save
    In online learning systems, to offer students better learning services, a fundamental task is predicting questions’ knowledge points, i.e., predicting the knowledge concepts or skills of a question. Existing methods for this task usually rely on human labeling or traditional machine learning methods, They are defected in either labor intensive or focusing only on shallow features without capturing the deep semantic relations between questions and knowledge points. In this paper, we propose an Expertise-enriched Convolutional Neural Network(ECNN)to predict questions’ knowledge points. Specifically, we first define and extract question features under the guidance of educational experience. Then, we leverage a convolutional neural network to exploit question representations from deep sematic perspective. After that, considering the relations between questions and expertise priors, we develop an attention based method for calculating the importance of expertise for questions. At last, we design an objective function for model learning that constrains both knowledge points and semantics. Extensive experiments on a large-scale dataset demonstrate the effectiveness of the proposed model, showing a good application value.