2017 Volume 31 Issue 5 Published: 16 October 2017
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    QIU Likun, ZHAO Hui, YU Shiwen, ZHU Xuefeng
    2017, 31(5): 1-7,20.
    Abstract ( ) PDF ( ) Knowledge map Save
    Part-of-speech annotation has attracted extensive attention from the areas including Chinese information processing, Chinese grammar study and Chinese lexicographer. Multiple part-of-speech systems have been proposed and there are significant differences between these systems. So far, little research has been done to systematically compare different large-scale part-of-speech annotations. Based on the part-of-speech annotation results in Dictionary of Contemporary Chinese and Grammatical Knowledge-Base Dictionary, this paper proposes a mapping algorithm, which can detect part-of-speech differences in two dictionaries automatically. Further, we analyze the differences and conclude in two perspectives. 1) about 83.5% of the part-of-speech annotation results is identical. and 2) all the differences can be attributed to three effects: part-of-speech shifting, different part-of-speech annotation standards and different senses.
  • Language Analysis and Calculation
    LIU Rui, SUN Bize, LONG Yunfei, WANG Shan
    2017, 31(5): 8-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on previous studies on frequency and frequency rank of words, this paper focuses on the analysis of the frequency rank difference (FRD) from the perspective of lexical quantitative analysis. This paper reveals that for the common words between texts, the FRDs are distributed symmetrically and gathered around the median. This characteristic assumes a “two-tailed distribution”, which is flat in the middle and curving in both ends. Three lexical levels, i.e. middle, downward end and upward end, are summarized based on the FRD distributions. The middle lexicon reflects the common characteristics of the two texts, while the lexicon that belongs to both ends reflects their own distinctive features. These features are of linguistic significance in reflecting the thematic content and stylistic features of the texts.
  • Language Analysis and Calculation
    NIU Changwei, CHENG Bangxiong
    2017, 31(5): 14-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are at least three interpretations of wh-phrases in Mandarin Chinese: interrogative reference, existential reference, and universal reference. This paper takes shenme as the example, and proposes a rule-based approach to recognize its interpretation in different syntactic contexts. After the testing of its preferred reference in the complex syntactic contexts, a semantic recognition model of shenme is built and revised by experiments.
  • Language Analysis and Calculation
    LIN Ziqi, NI Wancheng, ZHAO Meijing, YANG Yiping
    2017, 31(5): 21-31,49.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyzes the double-object phrase which is a special linguistic phenomenon from the syntactic and semantic perspective, and presents a semantic double-object expressive model based on Conceptual Knowledge Tree (CKT). Moreover, this paper proposes a method for analyzing the double-object phrases, which can automatically translate them into the semantic expressive model. It firstly, in a top-down style, classifies the syntactic parts of a double-object phrase into three parts - double-object verb, direct object and indirect object. And then, in a bottom-up style, it uses CKT to do inferences on these three parts and get their semantic expressions. Experiment on a dataset consisting of 122 double-object verbs and 209 phrases selected from authoritative literatures and grammar dictionaries reveals an accuracy 90.43%.
  • Language Analysis and Calculation
    ZHU Shuqin, PENG Weiming, SONG Jihua, GUO Dongdong
    2017, 31(5): 32-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    For the purpose of international Chinese teaching, this paper introduces the sentence-focused Diagramma-tic Treebank to preserve the integrity of the sentence pattern structure in grammar teaching. Based upon a thorough analysis the Treebank structures, the Chinese sentence pattern instance are summarized form each parse in via a hierarchical extraction strategy. Finally, a Chinese sentence pattern instance bank is achieved, consisting of basic sentence patterns and complex sentence patterns. This approach paves the way to develop Chinese sentence pattern instances for a small scale Treebank, and enables the practical application of Diagrammatic Treebank in the international Chinese teaching.
  • Language Analysis and Calculation
    LV Guoying, SU Na, LI Ru, WANG Zhiqiang
    2017, 31(5): 40-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on discourse coherence is an important issue in discourse analysis. Based on Chinese FrameNet(CFN), this paper presents a coherence description scheme for Chinese discourse. It establishes the relationship between the frames and discourse units, and discusses the ways to achieve the discourse coherence by the frames and semantic relationships between frames. This provides a description mechanism and computation basis for discourse coherence. Annotations of 160 articles are selected from the People's Daily shows a more than 0.8 kappa value in both discourse structure annotation and discourse relation annotation. This proves that the proposed scheme guarantee a high consistent manual annotation, which is crucial to larger-scale discourse annotating.
  • Machine Translation
  • Machine Translation
    LIU Mengyi, YAO Liang, HONG Yu, LIU Hao, YAO Jianmin
    2017, 31(5): 50-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on domain adaptation (DA) for statistical machine translation (SMT) aims at dynamically adjusting the translation model to ensure balanced and reliable translation quality in different domains. Existing researches on adaptation of translation model have made remarkable progress, but neglect the reordering issue. This paper investigates the translation samples in a large scale source bilingual corpus, revealing that 36.17% samples exhibits clear word order differences in phrase level translation pairs. Therefore, we propose a domain adaptive reordering model based on fusing topic information, to explore the reordering differences of phrases under different topic distribution. Experimental results show that translation systems with adaptive reordering model yield obvious performance improvements.
  • Other Language in/around China
  • Other Language in/around China
    CAI Rangzhuoma, CAI Zhijie
    2017, 31(5): 59-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the corpus-based text to speech system, the choices of unit selection impact directly on the quality of synthesized speech. By analyzing the features of Tibetan language, this paper proposes not only a hybrid strategy which mixed components, characters, words and sentences, but also a corpus-based unit selection algorism for Tibetan Speech Synthesis. Subjective assessment results and objective evaluation results indicate that the algorithms are effective, the coverage and synthesized speech of units are satisfactory reached expected target.
  • Other Language in/around China
    WANG Weilan, LU Xiaobao, CAI Zhengqi, SHEN Wentao, FU Ji, CAIKE Zhaxi
    2017, 31(5): 64-73.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan-Sanskrit includes more than 500 Tibetan characters, and more than 6000 Sanskrit. Belonging to the large class of character set, the sample collection of the online handwritten is a large and complex project. We present an online handwriting character sample generation method based on component combination for Tibetan-Sanskrit. The proposed method includes four main parts: (1) to determine the Tibetan-Sanskrit character set and component set; (2) to get location information of Tibetan-Sanskrit characters; (3) to collect online handwritten sample of component set for Tibetan-Sanskrit; and (4) to generate sample database of online handwritten Tibetan-Sanskrit character set. This provides the character's training sample set and test sample set for online handwritten Tibetan-Sanskrit.
  • Other Language in/around China
    FAN Daoerji, GAO Guanglai, WU Huijuan
    2017, 31(5): 74-80.
    Abstract ( ) PDF ( ) Knowledge map Save
    Hidden Markov Models(HMM) has strong modeling capabilities for sequence data, and it is widely used in speech recognition and handwriting recognition task. HMM-based Mongolian handwriting recognizers require the data to be analyzed sequentially. According to Mongolian word formation and writing style, it is evident that a Mongolian word consists of grapheme seamless connected from top to down. The selection of grapheme and segmentation word to grapheme is a preliminary work for handwriting recognition with substantial effects on recognition accuracy. In this paper, according to knowledge of syllables and coding, we collect a Mongolian letters set of 1 171 letters. The long grapheme set which contain 378 grapheme is then extracted from letters set by correlation process and HMM based sorting method. The short grapheme set which contain 50 shapes is extracted from long grapheme set via decompose long grapheme by hands. We present an algorithm to decompose a word to grapheme by two layers mapping. Experimental results show that the short grapheme get better performance than long grapheme.
  • Other Language in/around China
    CUI Rongyi, ZHAO Xue
    2017, 31(5): 81-84,91.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper aims to verify the Zipf's law in Korean language. Firstly, the statistical distribution is investigated for two linguistic units, words and alphabets, on a massive Korean text corpus. Then the least square method is adopted to simulate the curve of rank-frequency distribution of words in Korean text. Finally, the estimation values of the parameter of Zipf's law is calculated. The experimental results show that the relationship between frequency and rank of both linguistic units falls into the Zipf's law in Korean language.
  • Other Language in/around China
    S Loglo
    2017, 31(5): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic identification and annotation of fixed phrases are esseential to the Mongolian text processing. On the basis of “Mongolian Fixed Phrase Grammatical Information Dictionary”, this paper designs and implements an algorithm for Mongolian fixed phrase recognition and labeling based on finite state automata and rules. Experiments reavel an recognition rate of more than 90%, and an average processing speed of 0.005 millisecond per word.
  • Other Language in/around China
    TAO Doudou, YU Long, TIAN Shengwei, ZHAO Jianguo, Turgun·Ibrahim , Askar·Hamdulla
    2017, 31(5): 92-98,113.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focusedon Uyghur noun phrase coreference identification task, this paper proposed a Stacked Nonnegative Constrained Autoencoder( SNCAE) for anaphoricity determination based on semantic feature. Through the analysis of Uyghur noun phrase language phenomenon, 15 kinds of semantic features are extracted, and then input into SNCAE to extract the deep semantic features. Finally, the Softmax classifier is used to complete the recognition task. Compared with Support Vector Machine (SVM), the positive accuracy and negative accurate increased by 8.259% and 4.158%, respectively, and increased by 1.884% and 1.590%, respectively, than the Stacked Autoencoder (SAE).
  • Other Language in/around China
    Turdi Tohti, Winira Musajan, Askar Hamdulla
    2017, 31(5): 99-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in the text, It then assigns weights to them via a multi-feature fusion method and select the most important semantic strings as features to represent the text. The experimental results by K_means cluster shows that the text model constructed by semantic string feature is more compact than the text model constructed by word feature, not only greatly reducing the dimensions of feature space but also improving the performance of clustering algorithm.
  • Other Language in/around China
    Azragul, Azharjan, Yusup Abaydula, Zulkarjan, Mirxat
    2017, 31(5): 108-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this study, focused on the Uyghur mathematics textbooks in junior high school, the Uyghur stem are studied. This paper studies the basic stems in the textbooks, the new stems, and the high frequency stems. This provides reference materials for the Uighur language study, Uighur Mathematics Teaching and codification.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    MA Chaoyi, XU Weiran
    2017, 31(5): 114-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    The relation extraction is a fundamental task in information extraction, with practical significance in information retrieval, question answering system and knowledge mapping, etc. The existing relation extraction data set are for English, containing very limited categories and neglecting sentence level annotations. This paper constructs a Chinese relation extraction data set using a weakly supervised and semi-automatic method. It firstly extracts a large amount of relation pairs from Wikipedia, then extracts sentences that contains entity pairs from the corpus of Sougou News and Baidu. Thus the weakly supervised sentence extracting is completed. These sentences are then scored in an RNN-based relation extraction system, selecting sentences with higher score for manual annotation. Finally the Chinese relation extraction data set is completed after manual annotation.
  • Information Extraction and Text Mining
    WAN Guo, ZHANG Guiping, BAI Yu, ZHU Yaohui
    2017, 31(5): 120-126.
    Abstract ( ) PDF ( ) Knowledge map Save
    A topic sentence extraction method for news text is proposed. Firstly, the location feature is derived from the distribution of news topic sentence in the text. Then, the overlap ratio between a sentence and the title calculated owing to the interrelation of the news title with the theme. To best estimate the relevancy between the title and the candidate topic sentence, a maximum matching based on weighted bipartite graph is applied. Finally, the topic sentence is selected according to the sentence rank score. The experimental results show that the proposed method reaches 75.9% in P@1, and 92.4% in P@3.
  • Information Extraction and Text Mining
    ZHANG Ruqing, GUO Yan, LIU Yue, YU Xiaoming, CHENG Xueqi
    2017, 31(5): 127-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most of existing information extraction methods are focused on a specific type of webpages, rather than applicable to all webpages. In this paper, we propose a general framework based on fusion mechanism to enable the extraction of the theme information of all webpages. This framework combines the automatic information extraction strategy and the template detection strategy through four steps: template matching, template based extraction, web page classification and automatic extraction. The experiments show that the proposed strategy can lead to an additional performance improvement in the precision of extraction.
  • Information Extraction and Text Mining
    WU Yongliang, ZHAO Shuliang, LI Changjing, WEI Nadi, WANG Ziyan
    2017, 31(5): 138-145.
    Abstract ( ) PDF ( ) Knowledge map Save
    Text classification is the fundamental task for text mining. Many text classification algorithms have been presented in previous literatures, such as KNN, Nave Bayes, Support Vector Machine, and some improved algorithms. The performance of these algorithms depends on the data set and does not have self-learning function. This paper proposes an effective approach for text classification. The three key points of the approach are: 1)extracting the keywords of category (KWC) of labeled texts based on the TF-IDF approach, 2) classifying unlabeled text by the relevancy of category and unlabeled text, and 3) improving the performance of the approach via updating the KWC in the process of classification. Simulation experiment results show that the new approach can improve the accuracy of text classification to 90%, and even up to 95% when the data volume is large enough. The method can automatically update the keywords of category to improve the classification accuracy of the classifier.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    XIE Xiaohui, WANG Chao, LIU Yiqun, ZHANG Min, MA Shaoping
    2017, 31(5): 146-155.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rich media introduced into searching interface, the result pages of the search engine appear to be heterogeneous and in a form of two-dimensional distribution. To deal with this new challenge to traditional click model, we analyze the result pages of a popular commercial search engine and build a click model based on deep neural network, trying to reveal correlations between multimedia information and text information. This framework contains both the characteristics of neural network and prediction ability of click model. The experiment demonstrates that our framework is well improved compared to original click model. However, due to the complexity of multimedia contents, even deep neural network would produce quite weak semantic correlations if we rely merely on basic characteristics of multimedia results.
  • Information Retrieval and Question Answering
    FAN Yixing, GUO Jiafeng, LAN Yanyan, XU Jun, CHENG Xueqi
    2017, 31(5): 156-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional researches on information retrieval are focuse on document-level retrieval, neglecting, sentence-level information retrieval which is of great importance in such applications, as searching in mobile phone Assuming that the context sentence could provide richer evidence for matching. this paper proposes a context-aware deep sentence matching model(CDSMM). Specifically, the model employs bi-directional LSTM to capture the interior and exterior information of the sentence; Then, a matching matrix is constructed based on the sentence representation and query representation; Finally, we get the matching score after a feed forward neural network. Experiment results on the WebAP dataset show that out model can significantly out-perform the state-of-the-art models.
  • Information Retrieval and Question Answering
    JIANG Yu, SONG Xingshen, YANG Yuexiang, JIANG Kun
    2017, 31(5): 163-170.
    Abstract ( ) PDF ( ) Knowledge map Save
    Top-k query is a popular technique of search engines, which returns the most relative results for user from massive data. Although Top-k query significantly improves the performance of the system, its slow-start issue has not been effectively resolved. This paper extracts static Top-k information of inverted index and then calculats initial threshold in real time for specific query. On this basis, this paper presents a rapid start algorithm of Top-k query for the current state-of-art methods MaxScore and WAND. Experimental results show that the proposed approach achieves better performance.
  • Information Retrieval and Question Answering
    XU Jian, ZHANG Dong, LI Shoushan, WANG Hongling
    2017, 31(5): 171-177.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question classification is a basic task in question answering system. Previous studies only employ the monolingual corpus to train the question classification model, suffering from problems such as lack of corpus and short length of question text. To solve these problems, we propose a new approach named dual-channel LSTM model with bilingual information. Firstly, we extend the Chinese corpus and English corpus with the corresponding translated corpus. Secondly, the samples are represented by the question text and translation word vector. Finally, we build an question classifier using dual-channel LSTM model. The experimental result demonstrates that our approach improves the performance of question classification.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    ZHANG Yangsen, SUN Kuangyi, DU Cuilan, WANG Jian, TONG Lingling
    2017, 31(5): 178-184.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a cascaded classifier micro-blog sentiment analysis. The primary classifier is based on emotional dictionary and sina micro-blog emoticons dictionary. The secondary classifier is based on the orientation similarity, grouped by several key sentimental word. And the third-level classifier is built by using Nave Bayes. The micro-blogs are processed by the three classifiers in a pipeline style. The experimental results show that the method is effective compared against the NLPCC2014 micro-blog sentiment evaluation results.
  • Sentiment Analysis and Social Computing
    BAI Ting, WEN Jirong, ZHAO Xin, YANG Bohua
    2017, 31(5): 185-193.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long-tail products, with low demands, occupy a significant share of total revenue in total. It is challenging to analyze the long-tail purchase behaviors due to the data sparsity resulted from few purchase behaviors. This paper proposes to leverage online social media information for predicting the long-tail purchase behaviors. In specific, we collect the user profiles form the social media information, including the status text, following links and temporal activity distributions, and predict their purchases by a weighted Multiple Additive Regression Trees (MART). Experimented on the data from JingDong and SinaWeibo, the effectiveness of the proposed method are revealed, together with several interesting findings.
  • Sentiment Analysis and Social Computing
    ZHAO Mingzhen, LIN Hongfei, XU Bo, HAO Huihui
    2017, 31(5): 194-202.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of the Internet, social networks have accumulated large amounts of text data about health care. This paper presents an information entropy based method to recognize potential adverse drug reactions from user comments in health related social networks. Meanwhile, to recognize the potential adverse drug reactions, this paper proposes a protein association function based on Word2vec and Skip-gram. Following this functions, this paper tries to detect the evidences between drugs and their potential adverse drug reactions. The results show that this method is promising in providing evidence chain for potential adverse drug reactions.
  • Sentiment Analysis and Social Computing
    GUAN Saiping, JIN Xiaolong, XU Xueke, WU Dayong, JIA Yantao, WANG Yuanzhuo, LIU Yue
    2017, 31(5): 203-214.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of news websites, the news comments increase sharply, which are very important to public opinion analysis and news comments recommendation. This paper proposes a news comments clustering method, called EWMD-AP, to automatically mine the focuses of the public on the news. This method employs Word Mover's Distance (WMD) with enhanced weight vectors to calculate the distances between news comments. It also adopts Affinity Propagation (AP) to cluster comments, and finally obtains the clusters and their representative comments corresponding to the focuses of the public. Particularly, this paper proposes to replace the traditional word frequency based weight vectors in WMD with enhanced weight vectors, which consist of three components: the importance coefficient of words, the de-contextualization coefficient, and the traditional TFIDF coefficient. Experimental results on 24 news comments datasets demonstrate that EWMD-AP performs much better than both traditional clustering methods (e.g. Kmeans, Mean Shift, etc) and the state-of-the-art ones (e.g. Density Peaks, etc).
  • Sentiment Analysis and Social Computing
    LI Yumeng, LIAN Xubao, XU Bo, WANG Jian, LIN Hongfei
    2017, 31(5): 215-222.
    Abstract ( ) PDF ( ) Knowledge map Save
    “Next Basket” recommendation is a crucial task in E-commerce field. Traditional algorithms can be divided into sequential recommender and general recommender, both of which neglect the impact of implicit feedback behavior and time sensitivity of user's preferences. This paper proposes a “next basket” recommendation framework based on implicit user feedback. We divide the user behaviors into several time windows according to the timestamp of these behaviors, and model the user preference in different dimensions for each window. Then we utilize the convolutional neural network to train a classifier. Compared to traditional linear models and tree models on a real dataset, the proposed model improves the user satisfaction with the recommender system.