2017 Volume 31 Issue 4 Published: 15 August 2017
  

  • Select all
    |
    Article
  • Article
    LIU Lu, KANG Shiyong
    2017, 31(4): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    From the perspective of semantic construction, this paper explains the denotation of undirected nouns. It further proposes six types of connotation of nouns according to qualia structure theory, and tries to interpret the semantic construction of undirected nouns by metonymy, metaphor and metaphtonymy. According to the way to transform the morpheme sense into word sense, undirected nouns are further classified into eight types. Based on qualia structure theory, we investigate which qualia role of a morpheme will be integrated into the meaning of the whole word. Finally, we summarize the rules of mapping from morpheme sense to word sense, indicating that prev-last metonymy and prev-last metaphor are most popular.
  • Article
    DENG Liping, LUO Zhiyong
    2017, 31(4): 9-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Applying the minimum entropy regularization framework to the supervised CRF model, this paper proposes a semi-supervised CRF model that combing the supervised learning on the labeled text in common domain with the unsupervised learning on the unlabeled text in the target professional domain. The domain adaptation is further improved by introducing a domain dictionary and a tagged corpus. Experiments on a cross domain segmentation task show that proposed method out-performs supervised CRF in terms of OOV recall and F-value.
  • Article
    MA Chuangxin, CHEN Xiaohe
    2017, 31(4): 20-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    The language style of literature is the embodiment of the author's mindset using language. For a quantitative analysis of the language style, this paper analyzes the word distribution in the pre-Qin literatures, collecting eight classic literatures as the corpus. The power-law distribution is again testified. Then the correlation coefficient of the word type grades between the literatures are calculated. We show that the language style differs not only in the use of common words, but also in the word types grade.
  • Article
    ZHANG Hainan, WU Dayong, LIU Yue, CHENG Xueqi
    2017, 31(4): 28-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese NER is challenged by the implicit word boundary, lack of capitalization, and the polysemy of a single character in different words. This paper proposes a novel character-word joint encoding method in a deep learning framework for Chinese NER. It decreases the effect of improper word segmentation and sparse word dictionary in word-only embedding, while improves the results in character-only embedding of context missing. Experiments on the corpus of the Chinese Peoples' Daily Newspaper in 1998 demonstrates a good results: at least 1.6%, 8% and 3% improvements, respectively, in location, person and organization recognition tasks compared with character or word features; and 96.8%, 94.6%, 88.6% in F1, respectively, on location, person and organization recognition tasks if integrated with part of speech feature.
  • Article
    WANG Qiang, DU Quan, XIAO Tong, ZHU Jingbo
    2017, 31(4): 36-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a transfer-triangulation method for pivot-based translation between two languages with poor bilingual data. It takes the best of both typical transfer method and triangulation method for pivot-based translation, and decodes pivot phrases to improve phrase table. Evaluated on German-Chinese translation task with English as the pivot language, results show that our method achieves significant improvement over baseline pivot-based methods.
  • Article
    MA Bin, CAI Dongfeng, JI Duo, YE Na, WU Chuang
    2017, 31(4): 44-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    The traditional interactive machine translation (IMT) is focused on the current source language and the partial translation of the target language, neglecting the feedback from the translators to better predict the subsequent translations. This paper investigates the translation selection clicks, and proposes a dynamic word alignment model for the partial translation. Experiment indicates this method improves the word prediction accuracy during the interactive machine translation process.
  • Article
    XUE Zhengshan, ZHANG Dakun, WANG Lina, HAO Jie
    2017, 31(4): 50-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long sentence segmentation is a valid issue in optimizing the quality of machine translation. This paper proposes a new method for long sentence segmentation during the training process. This method automatically decides the boundary words and their probabilities without manual intervention, which results more meaningful segmentation in semantics. Also, the length of segmented sub-sentences are balanced through both source and target languages. Experiments on the NIST test sets show an improvement of up to 0.5 BLEU scores.
  • Article
    YANG Zhenxin, LI Miao, CHEN Lei, WEI Linyu, CHEN Sheng, SUN Kai
    2017, 31(4): 57-62.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the morphological difference between Chinese and Mongolian, this paper proposes a method of adopting morpheme of Mongolian as the pivot to Chinese-Mongolian statistical machine translation (SMT). First, we segment Mongolian word into morphemes, achieving a balance in the morphology of the language pair. Then, we treat Mongolian morpheme as pivot language and construct two new SMT systems: Chinese-Morpheme SMT and Morpheme-Mongolian SMT. New translation knowledge including phrase translation table and reordering model is introduced for these two SMT systems. Finally, we use multiple decoding paths and multiple features to incorporate the new translation knowledge. Experimental results demonstrate our method can improve the translation quality significantly.
  • Article
    XIONG Mingming, LIU Yanchao,GUO Jianyi, YU Zhengtao,ZHOU Lanjiang, CHEN Xiuqin
    2017, 31(4): 63-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the rich cross ambiguities in Vietnamese, this paper adopts the Maximum Entropy approach using the selected statistical features, contextual features and internal features of the ambiguity segments. It constructs a Vietnamese dictionary of 174 646 entries, which brings about 5 377 segments of cross ambiguities among 25 981 Vietnamese sentences with golden labels. A 5-fold cross validation experiment shows that the accuracy of the proposed method canachieve 87.86% which out performs the CRFs.
  • Article
    Turdi Tohti, Winira Musajan, Askar Hamdulla
    2017, 31(4): 70-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    A fast Uyghur semantic string extraction method is proposed based on statistical model and shallow linguistic parsing. It employs a multilayered dynamic indexing structure to build word index for large-scale text. Combined with the Uyghur word association rules, an improved n-gram incremental algorithm is designed for word string extension, trying to capture the credible frequent patterns in the text. The final semantic strings are determined after the structural integrity of the frequent pattern is verified. Experiments on different corpus indicate that this method is feasible and effective.
  • Article
    LI Dongbai, TIAN Shengwei, YU Long, Turgun Ibrahim, FENG Guanjun
    2017, 31(4): 80-88.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference resolution is a fundamental issue in natural language processing. Combining the semantic features of Uyghur, a method of Uyghur pronominal anaphora resolution based on Deep Learning is proposed. The proposed DBN (Deep Belief Nets) learning model is composed of several unsupervised RBM networks and a supervised BP network. The RBM layers preserve information as much as possible when feature vectors are mapped to next layer. The BP layer is able to classify the vector output by the last RBM layer. Then the model can be used to implement Uyghur pronominal anaphora resolution. Experiments on Uyghur coreference resolution corpus achieve 83.81% in F-score, 2.88% higher than SVM.
  • Article
    LONG Congjun, LIU Huidan, WU Jian
    2017, 31(4): 89-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    “Syllables” of Tibetan language are very important in vocabulary construction and text information processing, especially for solving the segmentation and annotation of OOVs. This paper proposes to tag the syllables, which can be applied to predict POS of compound words (especially OOVs) according to the rules of words-construction. This paper presents the definition of the Tibetan syllable, outlines and the principles of classification and labeling. The train and test texts are selected from teaching material of Tibetan language of primary and secondary schools, total 240K syllables. Experiments reveals a precision of 93.5208% for syllable tagging, upon which an improved 94.1967% accuracy for POS tagging can be reached. And given the gold-standard of syllable tagging, the accuracy of POS tagging will be improved to 97.775 4%.
  • Article
    DONG Jun, JIANG Tonghai, Aizimaiti Ainiware, CHENG Li, XU Chun
    2017, 31(4): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes the special writing rules of the Kazakh letters and , pointing out the current substitution method does not comply with international or national standards and obstructs Kazakh processing in text sorting, script conversion and speech synthesis. This paper proposed three improvements, i.e. 1) representing the four special letters with the combination of themselves and character ; 2) include only isolated forms with in OpenType font; and 3) identifying the contexts that are not adjacent to the Kazakh letter based on the glyph substitute rule <calt> in OpenType font. To facilitate the application of the above suggestions, this paper describes the set of the glyph substitution rules in OpenType font which is consistent with the improved method.
  • Article
    Turdi Tohti, Winira Musajan, Askar Hamdulla
    2017, 31(4): 100-107.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in Uyghur texts. Then the topics are described by these weighted semantic strings. Based on these features, the Uyghur text classification is conducted by a new-designed Jaccard-like similarity measure. Experimental results show that the proposed method achieves comparable performance with a reasonable computation cost with regard to two traditional classifiers.
  • Article
    KE Yonghong, YU Shiwen, SUI Zhifang, SONG Jihua
    2017, 31(4): 108-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    The performance and robustness of the natural language processing system depend strongly on annotated corpus.To meet the requirement of large scale and high quality corpus annotation, this paper describes an annotation method based on collective intelligence, including the system structure, user capacity evaluation, data selection, task management, collaborative tagging, behavior analysis, quality control, judgement and optimaztion. Project practice shows the annotation method based on collective intelligence has significant advantages for natural language processing research projects.
  • Article
    HE Jing, SONG Tianbao, PENG Weiming, ZHU Shuqin, SONG Jihua
    2017, 31(4): 114-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    An efficient approach for ancient Chinese treebank construction is proposed, which is based on "word or POS" match strategy. To deal with the ancient Chinese characterized by short-clauses and typical-patterns, it divides the Chinese treebank construction into four steps: 1) candidate match pattern generation; 2) syntactic transformation rule composition; 3) syntactic parsing; 4) manual verification. In addition to minimize the manual annotation cost in treebank construction, the match patterns obtained during this process can provide data support for the ancient Chinese teaching and research.
  • Article
    WANG Jianan, LU Qiang
    2017, 31(4): 122-131.
    Abstract ( ) PDF ( ) Knowledge map Save
    Distant supervision for relation extraction is an approach that can extract relations from texts automatically by aligning a database of facts with texts. Most of existing solutions are feature-based algorithms with certain defects. In this paper, we propose a pattern-based algorithm for distant supervised relation extraction with pattern-based vector. A kernel-based method is used in the algorithm to overcome the problems in feature-based algorithm. The experimental result shows that our algorithm can successfully improve the precision of distant supervision for relation extraction.
  • Article
    YE Min, TANG Shiping, NIU Zhendong
    2017, 31(4): 132-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the framework of the vector space model(VSM), a new PCHI-PTFIDF(promoted CHI-promoted TFIDF)method based on feature selection and weight calculation is proposed. First, the factors of frequency, concentration, dispersion and location are introduced to CHi-Square based feature selection. Then, the TF-IDF weight is proposed to be optimized by the length and location factors of text terms. The proposed method can reduce the dimensions of the features with better classification ability, and produce better estimation of the weight distribution. The experimental results show that, compared with the algorithm using the traditional CHI and traditional TFIDF, the PCHI-PTFIDF method achieves 10% improvement in Macro-F1 on average.
  • Article
    LUO Junfan, CHEN Li, YU Zhonghua, DING Gejian, LUO Qian
    2017, 31(4): 138-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with the text segmentation for academic paper abstracts, an unsupervised text segmentation algorithm is proposed, which incorporates constraint of the length distribution derived from the preference of length uniformity in different discussion aspects (i.e. content blocks) of an abstract. A metric based on information entropy is introduced to the algorithm to measure the length distribution uniformity, and the object function is designed with further combination of semantic similarities of inter-and intra-content blocks. A standard dynamic programming scheme is employed to determine the best segmentation sequence. Experimented on 8603 abstracts from Medline, the results show an improvement of 3% in accuracy compared with baselines.
  • Article
    LIU Peng, TENG Jiayu, DING Enjie, MENG Lei
    2017, 31(4): 145-153.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to sharp increase of internet texts, the processing of k-means on such data is incredibly lengthened. Some classic parallel architectures, such as Hadoop, have not improved the execution efficiency of K-means, because the frequent iteration in such algorithms is hard to be efficiently handled. This paper proposed a parallelization algorithm of k-means based on Spark. It makes full use of in-memory-computing RDD model of Spark so as to well meet the frequent iteration requirement of k-means. Experimental results show that k-means executes much more efficiently in Spark than in Hadoop on the same datasets and the same computing environments.
  • Article
    YANG Wenjing, QIU Yongqin, LI Sixu,LI Rui,WANG Bin
    2017, 31(4): 154-164.
    Abstract ( ) PDF ( ) Knowledge map Save
    Online Event Retrieval is a retrieval task for event queries, which returns important event-related documents from mini-batch data sets iteratively in chronological order. This paper propose san online event retrieval framework based on two kinds of graphs: event key-words co-occurrence graph and bipartite graph incorporated with event type. Case study and experiments on two pubic TREC corpus indicate that our approach improves the event retrieval precision significantly (maximum increase reaches 30%, average reaches 5.85% in metric P@10).
  • Article
    HU Sha, DOU Zhicheng, WEN Jirong
    2017, 31(4): 165-173.
    Abstract ( ) PDF ( ) Knowledge map Save
    The search result diversification re-ranks search results to cover as many user intents as possible in the top ranks. Most intent-aware diversification algorithms use subtopics to diversify results. Focuses on the granularity of subtopics, this paper investigates the performance of diversification algorithms by using subtopics with different granularities. Experimental results show that state-of-the-art diversification algorithms work better by using fine-grained subtopics.
  • Article
    MAN Tong, SHEN Huawei, HUANG Junming, CHENG Xueqi
    2017, 31(4): 174-183.
    Abstract ( ) PDF ( ) Knowledge map Save
    Data sparsity is a challenge forrecommender systems.In recent years, the integration of data from different sources provides a promising direction for the solution of this issue. However, most existing methods for data integration assume that the representation of a single user/item is the same across different contexts, which blocksthe depiction of the distinct characteristics of different contexts. In this paper, we propose a matrix factorization model with soft constraint that the difference between the representations of a single user/item is minimized together with the error function of matrix factorization model. Experiments on two datasets demonstrate that the proposed model outperforms thestate-of-the-art models, especially on the case where the data is sparse in only one resource.
  • Article
    WU Hui, ZHANG Shaowu, LIN Hongfei
    2017, 31(4): 184-190.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper investigates the evaluation of the user influence on Sina microblog. Among various factors, a user is considered as more influential if his information is disseminated faster to a larger extent. Compared with traditional methods, the user's active degree and the quality of posts are both taken into consideration. Treating each user as a node in the social network, the final user influence is estimated. The experiments on both public dataset and real dataset from Sina microblog show the validity of the method.
  • Article
    CHEN Chang, WEI Jingjing, LIAO Xiangwen, LIN Bogang, CHEN Guolong
    2017, 31(4): 191-198.
    Abstract ( ) PDF ( ) Knowledge map Save
    Social media has become an popular platform for sharing and exchanging information. The identification of users of social influence has already been applied into many applications including recommendation systems, experts finding, social advertising et al. This paper proposes a constrained tensor factorization method to identify users with high social influence. In the factorization result, the polairy allocation of influence is preserved (i.e. positive, neutral and negative influence). This method fuses topical similarity of users by Laplacian matrix, which would control tensor factorization to approximate the user influence. Experimental results demonstrate that the method outperformes the OOLAM, TwitterRank etc. in terms of ranking accuracy.
  • Article
    LIU Qiang, LI Jingyuan, WANG Yuanzhuo, LIU Yue, REN Yan
    2017, 31(4): 199-207.
    Abstract ( ) PDF ( ) Knowledge map Save
    Online media experienced a huge improvement in the last few years, causing the user preference prediction a substantial issue so as to increase the user's clicks. The data sparsity in both the user information and the historical behavior records deteriorates many well-known predication system. Based on data of Google users, this paper reveals that the user's “likes” on online media are converged. In particular, we detect the correlation between the user “likes” on online media and his profile in social network, suggesting that the user profile in social network can predict user's likes on online media. Based on the correlation, we apply the user's social network description to predict his “likes” on online media, resulting more than 17% improvement in precision compared with algorithms using only the user information from online media.
  • Article
    FU Bo, CHEN Yiheng, SHAO Yanqiu, LIU Ting
    2017, 31(4): 208-215.
    Abstract ( ) PDF ( ) Knowledge map Save
    Consumption Intent refers to an exact indication of an immediate or future purchase in microblog. For example, a post like “I want to buy a mobile phone” indicates a buying intention. The paper proposes to study the problem of identifying consumption intent in microblogs based on user naturally annotated resources. Specifically, the proposed method recasts consumption intent recognition as a domain adaptation problem, and presents an approach utilizing automatic acquisition of large text corpora for classification. First, we look for a set of common features generalizable across domain adaptation, and then we extract the high confidence of pseudo annotation samples. Finally, we pick up useful features specific to the target domain. Experimental results show that the proposed method is effective for consumption intent recognition, achieving 69% and 77% in F-value, respectively. And, the features adopted are all contributive to the performance.
  • Article
    CHEN Yiheng, LI Xueting, WANG Biao, LIU Ting
    2017, 31(4): 216-222.
    Abstract ( ) PDF ( ) Knowledge map Save
    The analysis of the user influence in the social network is a key research issue in social marketing. This paper is focused on several network structure based algorithms for user influence analysis, and conducts a contrastive study on their performances.