Journal of Chinese Information Processing

Select

Article

Study on Semantic Structures of Undirected Nouns Based on Qualia Structure Theory

LIU Lu, KANG Shiyong

2017, 31(4): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

From the perspective of semantic construction, this paper explains the denotation of undirected nouns. It further proposes six types of connotation of nouns according to qualia structure theory, and tries to interpret the semantic construction of undirected nouns by metonymy, metaphor and metaphtonymy. According to the way to transform the morpheme sense into word sense, undirected nouns are further classified into eight types. Based on qualia structure theory, we investigate which qualia role of a morpheme will be integrated into the meaning of the whole word. Finally, we summarize the rules of mapping from morpheme sense to word sense, indicating that prev-last metonymy and prev-last metaphor are most popular.

Select

Article

Domain Adaptation of Chinese Word Segmentation on Semi-Supervised Conditional Random Fields

DENG Liping, LUO Zhiyong

2017, 31(4): 9-19.

Abstract ( ) PDF ( )

Knowledge map

Save

Applying the minimum entropy regularization framework to the supervised CRF model, this paper proposes a semi-supervised CRF model that combing the supervised learning on the labeled text in common domain with the unsupervised learning on the unlabeled text in the target professional domain. The domain adaptation is further improved by introducing a domain dictionary and a tagged corpus. Experiments on a cross domain segmentation task show that proposed method out-performs supervised CRF in terms of OOV recall and F-value.

Select

Article

Word Distribution, Word Type Grades and Style Computing in Literatures

MA Chuangxin, CHEN Xiaohe

2017, 31(4): 20-27.

Abstract ( ) PDF ( )

Knowledge map

Save

The language style of literature is the embodiment of the author's mindset using language. For a quantitative analysis of the language style, this paper analyzes the word distribution in the pre-Qin literatures, collecting eight classic literatures as the corpus. The power-law distribution is again testified. Then the correlation coefficient of the word type grades between the literatures are calculated. We show that the language style differs not only in the use of common words, but also in the word types grade.

Select

Article

Chinese Named Entity Recognition Based on Deep Neural Network

ZHANG Hainan, WU Dayong, LIU Yue, CHENG Xueqi

2017, 31(4): 28-35.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese NER is challenged by the implicit word boundary, lack of capitalization, and the polysemy of a single character in different words. This paper proposes a novel character-word joint encoding method in a deep learning framework for Chinese NER. It decreases the effect of improper word segmentation and sparse word dictionary in word-only embedding, while improves the results in character-only embedding of context missing. Experiments on the corpus of the Chinese Peoples' Daily Newspaper in 1998 demonstrates a good results: at least 1.6%, 8% and 3% improvements, respectively, in location, person and organization recognition tasks compared with character or word features; and 96.8%, 94.6%, 88.6% in F1, respectively, on location, person and organization recognition tasks if integrated with part of speech feature.

Select

Article

Transfer-Triangulation Method for Pivot-Based Statistical Machine Translation

WANG Qiang, DU Quan, XIAO Tong, ZHU Jingbo

2017, 31(4): 36-43.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper presents a transfer-triangulation method for pivot-based translation between two languages with poor bilingual data. It takes the best of both typical transfer method and triangulation method for pivot-based translation, and decodes pivot phrases to improve phrase table. Evaluated on German-Chinese translation task with English as the pivot language, results show that our method achieves significant improvement over baseline pivot-based methods.

Select

Article

Interactive Machine Translation by Dynamic Word Alignment

MA Bin, CAI Dongfeng, JI Duo, YE Na, WU Chuang

2017, 31(4): 44-49.

Abstract ( ) PDF ( )

Knowledge map

Save

The traditional interactive machine translation (IMT) is focused on the current source language and the partial translation of the target language, neglecting the feedback from the translators to better predict the subsequent translations. This paper investigates the translation selection clicks, and proposes a dynamic word alignment model for the partial translation. Experiment indicates this method improves the word prediction accuracy during the interactive machine translation process.

Select

Article

An Improved Sentence Segmentation Model for Machine Translation

XUE Zhengshan, ZHANG Dakun, WANG Lina, HAO Jie

2017, 31(4): 50-56.

Abstract ( ) PDF ( )

Knowledge map

Save

Long sentence segmentation is a valid issue in optimizing the quality of machine translation. This paper proposes a new method for long sentence segmentation during the training process. This method automatically decides the boundary words and their probabilities without manual intervention, which results more meaningful segmentation in semantics. Also, the length of segmented sub-sentences are balanced through both source and target languages. Experiments on the NIST test sets show an improvement of up to 0.5 BLEU scores.

Select

Article

A Morpheme-Based Approach for Chinese-Mongolian SMT

YANG Zhenxin, LI Miao, CHEN Lei, WEI Linyu, CHEN Sheng, SUN Kai

2017, 31(4): 57-62.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with the morphological difference between Chinese and Mongolian, this paper proposes a method of adopting morpheme of Mongolian as the pivot to Chinese-Mongolian statistical machine translation (SMT). First, we segment Mongolian word into morphemes, achieving a balance in the morphology of the language pair. Then, we treat Mongolian morpheme as pivot language and construct two new SMT systems: Chinese-Morpheme SMT and Morpheme-Mongolian SMT. New translation knowledge including phrase translation table and reordering model is introduced for these two SMT systems. Finally, we use multiple decoding paths and multiple features to incorporate the new translation knowledge. Experimental results demonstrate our method can improve the translation quality significantly.

Select

Article

Vietnamese Cross Ambiguity Resolution Based on Maximum Entropy Model

XIONG Mingming, LIU Yanchao,GUO Jianyi, YU Zhengtao,ZHOU Lanjiang, CHEN Xiuqin

2017, 31(4): 63-69.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with the rich cross ambiguities in Vietnamese, this paper adopts the Maximum Entropy approach using the selected statistical features, contextual features and internal features of the ambiguity segments. It constructs a Vietnamese dictionary of 174 646 entries, which brings about 5 377 segments of cross ambiguities among 25 981 Vietnamese sentences with golden labels. A 5-fold cross validation experiment shows that the accuracy of the proposed method canachieve 87.86% which out performs the CRFs.

Select

Article

Uyghur Semantic String Extraction Based on Statistical Model and Shallow Linguistic Parsing

Turdi Tohti, Winira Musajan, Askar Hamdulla

2017, 31(4): 70-79.

Abstract ( ) PDF ( )

Knowledge map

Save

A fast Uyghur semantic string extraction method is proposed based on statistical model and shallow linguistic parsing. It employs a multilayered dynamic indexing structure to build word index for large-scale text. Combined with the Uyghur word association rules, an improved n-gram incremental algorithm is designed for word string extension, trying to capture the credible frequent patterns in the text. The final semantic strings are determined after the structural integrity of the frequent pattern is verified. Experiments on different corpus indicate that this method is feasible and effective.

Select

Article

Deep Learning for Pronominal Anaphora Resolution in Uyghur

LI Dongbai, TIAN Shengwei, YU Long, Turgun Ibrahim, FENG Guanjun

2017, 31(4): 80-88.

Abstract ( ) PDF ( )

Knowledge map

Save

Coreference resolution is a fundamental issue in natural language processing. Combining the semantic features of Uyghur, a method of Uyghur pronominal anaphora resolution based on Deep Learning is proposed. The proposed DBN (Deep Belief Nets) learning model is composed of several unsupervised RBM networks and a supervised BP network. The RBM layers preserve information as much as possible when feature vectors are mapped to next layer. The BP layer is able to classify the vector output by the last RBM layer. Then the model can be used to implement Uyghur pronominal anaphora resolution. Experiments on Uyghur coreference resolution corpus achieve 83.81% in F-score, 2.88% higher than SVM.

Select

Article

Research on Tagging of Tibetan Syllables

LONG Congjun, LIU Huidan, WU Jian

2017, 31(4): 89-93.

Abstract ( ) PDF ( )

Knowledge map

Save

“Syllables” of Tibetan language are very important in vocabulary construction and text information processing, especially for solving the segmentation and annotation of OOVs. This paper proposes to tag the syllables, which can be applied to predict POS of compound words (especially OOVs) according to the rules of words-construction. This paper presents the definition of the Tibetan syllable, outlines and the principles of classification and labeling. The train and test texts are selected from teaching material of Tibetan language of primary and secondary schools, total 240K syllables. Experiments reveals a precision of 93.5208% for syllable tagging, upon which an improved 94.1967% accuracy for POS tagging can be reached. And given the gold-standard of syllable tagging, the accuracy of POS tagging will be improved to 97.775 4%.

Select

Article

An Improved Kazakh Letter Representation

DONG Jun, JIANG Tonghai, Aizimaiti Ainiware, CHENG Li, XU Chun

2017, 31(4): 94-99.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper describes the special writing rules of the Kazakh letters

and

, pointing out the current substitution method does not comply with international or national standards and obstructs Kazakh processing in text sorting, script conversion and speech synthesis. This paper proposed three improvements, i.e. 1) representing the four special letters with the combination of themselves and character

; 2) include only isolated forms

with

in OpenType font; and 3) identifying the contexts that are not adjacent to the Kazakh letter based on the glyph substitute rule <calt> in OpenType font. To facilitate the application of the above suggestions, this paper describes the set of the glyph substitution rules in OpenType font which is consistent with the improved method.

Select

Article

Semantic String-Based Topic Similarity Measuring Approach for Uyghur Text Classification

Turdi Tohti, Winira Musajan, Askar Hamdulla

2017, 31(4): 100-107.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an improved frequent pattern-growth approach to discover and extract the semantic strings which express key information in Uyghur texts. Then the topics are described by these weighted semantic strings. Based on these features, the Uyghur text classification is conducted by a new-designed Jaccard-like similarity measure. Experimental results show that the proposed method achieves comparable performance with a reasonable computation cost with regard to two traditional classifiers.

Select

Article

Research on Corpus Annotation Method Based on Collective Intelligence

KE Yonghong, YU Shiwen, SUI Zhifang, SONG Jihua

2017, 31(4): 108-113.

Abstract ( ) PDF ( )

Knowledge map

Save

The performance and robustness of the natural language processing system depend strongly on annotated corpus.To meet the requirement of large scale and high quality corpus annotation, this paper describes an annotation method based on collective intelligence, including the system structure, user capacity evaluation, data selection, task management, collaborative tagging, behavior analysis, quality control, judgement and optimaztion. Project practice shows the annotation method based on collective intelligence has significant advantages for natural language processing research projects.

Select

Article

An Efficient Approach to Ancient Chinese Treebank Construction Based on “Word or POS” Match

HE Jing, SONG Tianbao, PENG Weiming, ZHU Shuqin, SONG Jihua

2017, 31(4): 114-121.

Abstract ( ) PDF ( )

Knowledge map

Save

An efficient approach for ancient Chinese treebank construction is proposed, which is based on "word or POS" match strategy. To deal with the ancient Chinese characterized by short-clauses and typical-patterns, it divides the Chinese treebank construction into four steps: 1) candidate match pattern generation； 2) syntactic transformation rule composition； 3) syntactic parsing； 4) manual verification. In addition to minimize the manual annotation cost in treebank construction, the match patterns obtained during this process can provide data support for the ancient Chinese teaching and research.

Select

Article

Pattern-Based Distant Supervision for Relation Extraction Algorithm

WANG Jianan, LU Qiang

2017, 31(4): 122-131.

Abstract ( ) PDF ( )

Knowledge map

Save

Distant supervision for relation extraction is an approach that can extract relations from texts automatically by aligning a database of facts with texts. Most of existing solutions are feature-based algorithms with certain defects. In this paper, we propose a pattern-based algorithm for distant supervised relation extraction with pattern-based vector. A kernel-based method is used in the algorithm to overcome the problems in feature-based algorithm. The experimental result shows that our algorithm can successfully improve the precision of distant supervision for relation extraction.

Select

Article

An Improved Chinese Text Classification Algorithm Based On Multiple Feature Factors

YE Min, TANG Shiping, NIU Zhendong

2017, 31(4): 132-137.

Abstract ( ) PDF ( )

Knowledge map

Save

In the framework of the vector space model（VSM）, a new PCHI-PTFIDF（promoted CHI-promoted TFIDF）method based on feature selection and weight calculation is proposed. First, the factors of frequency, concentration, dispersion and location are introduced to CHi-Square based feature selection. Then, the TF-IDF weight is proposed to be optimized by the length and location factors of text terms. The proposed method can reduce the dimensions of the features with better classification ability, and produce better estimation of the weight distribution. The experimental results show that, compared with the algorithm using the traditional CHI and traditional TFIDF, the PCHI-PTFIDF method achieves 10% improvement in Macro-F1 on average.

Select

Article

A Length Distribution Constrained Text Segmentation for Paper Abstracts

LUO Junfan, CHEN Li, YU Zhonghua, DING Gejian, LUO Qian

2017, 31(4): 138-144.

Abstract ( ) PDF ( )

Knowledge map

Save

To deal with the text segmentation for academic paper abstracts, an unsupervised text segmentation algorithm is proposed, which incorporates constraint of the length distribution derived from the preference of length uniformity in different discussion aspects (i.e. content blocks) of an abstract. A metric based on information entropy is introduced to the algorithm to measure the length distribution uniformity, and the object function is designed with further combination of semantic similarities of inter-and intra-content blocks. A standard dynamic programming scheme is employed to determine the best segmentation sequence. Experimented on 8603 abstracts from Medline, the results show an improvement of 3% in accuracy compared with baselines.

Select

Article

Parallel K-means Algorithm for Massive Texts on Spark

LIU Peng, TENG Jiayu, DING Enjie, MENG Lei

2017, 31(4): 145-153.

Abstract ( ) PDF ( )

Knowledge map

Save

Due to sharp increase of internet texts, the processing of k-means on such data is incredibly lengthened. Some classic parallel architectures, such as Hadoop, have not improved the execution efficiency of K-means, because the frequent iteration in such algorithms is hard to be efficiently handled. This paper proposed a parallelization algorithm of k-means based on Spark. It makes full use of in-memory-computing RDD model of Spark so as to well meet the frequent iteration requirement of k-means. Experimental results show that k-means executes much more efficiently in Spark than in Hadoop on the same datasets and the same computing environments.

Select

Article

Online Event Retrieval Based on Event Graph

YANG Wenjing, QIU Yongqin, LI Sixu,LI Rui,WANG Bin

2017, 31(4): 154-164.

Abstract ( ) PDF ( )

Knowledge map

Save

Online Event Retrieval is a retrieval task for event queries, which returns important event-related documents from mini-batch data sets iteratively in chronological order. This paper propose san online event retrieval framework based on two kinds of graphs: event key-words co-occurrence graph and bipartite graph incorporated with event type. Case study and experiments on two pubic TREC corpus indicate that our approach improves the event retrieval precision significantly (maximum increase reaches 30%, average reaches 5.85% in metric P@10).

Select

Article

The Impact of Various Grained Subtopics on Search Result Diversification

HU Sha, DOU Zhicheng, WEN Jirong

2017, 31(4): 165-173.

Abstract ( ) PDF ( )

Knowledge map

Save

The search result diversification re-ranks search results to cover as many user intents as possible in the top ranks. Most intent-aware diversification algorithms use subtopics to diversify results. Focuses on the granularity of subtopics, this paper investigates the performance of diversification algorithms by using subtopics with different granularities. Experimental results show that state-of-the-art diversification algorithms work better by using fine-grained subtopics.

Select

Article

SCMF: A Matrix Factorization Model With Soft

MAN Tong, SHEN Huawei, HUANG Junming, CHENG Xueqi

2017, 31(4): 174-183.

Abstract ( ) PDF ( )

Knowledge map

Save

Data sparsity is a challenge forrecommender systems.In recent years, the integration of data from different sources provides a promising direction for the solution of this issue. However, most existing methods for data integration assume that the representation of a single user/item is the same across different contexts, which blocksthe depiction of the distinct characteristics of different contexts. In this paper, we propose a matrix factorization model with soft constraint that the difference between the representations of a single user/item is minimized together with the error function of matrix factorization model. Experiments on two datasets demonstrate that the proposed model outperforms thestate-of-the-art models, especially on the case where the data is sparse in only one resource.

Select

Article

Evaluation of the User's Influence on Microblog

WU Hui, ZHANG Shaowu, LIN Hongfei

2017, 31(4): 184-190.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper investigates the evaluation of the user influence on Sina microblog. Among various factors, a user is considered as more influential if his information is disseminated faster to a larger extent. Compared with traditional methods, the user's active degree and the quality of posts are both taken into consideration. Treating each user as a node in the social network, the final user influence is estimated. The experiments on both public dataset and real dataset from Sina microblog show the validity of the method.

Select

Article

Social Influence Analysis Considering User Opinion

CHEN Chang, WEI Jingjing, LIAO Xiangwen, LIN Bogang, CHEN Guolong

2017, 31(4): 191-198.

Abstract ( ) PDF ( )

Knowledge map

Save

Social media has become an popular platform for sharing and exchanging information. The identification of users of social influence has already been applied into many applications including recommendation systems, experts finding, social advertising et al. This paper proposes a constrained tensor factorization method to identify users with high social influence. In the factorization result, the polairy allocation of influence is preserved (i.e. positive, neutral and negative influence). This method fuses topical similarity of users by Laplacian matrix, which would control tensor factorization to approximate the user influence. Experimental results demonstrate that the method outperformes the OOLAM, TwitterRank etc. in terms of ranking accuracy.

Select

Article

Predicting Social-Network Users- “Likes” on Other Online Media

LIU Qiang, LI Jingyuan, WANG Yuanzhuo, LIU Yue, REN Yan

2017, 31(4): 199-207.

Abstract ( ) PDF ( )

Knowledge map

Save

Online media experienced a huge improvement in the last few years, causing the user preference prediction a substantial issue so as to increase the user's clicks. The data sparsity in both the user information and the historical behavior records deteriorates many well-known predication system. Based on data of Google users, this paper reveals that the user's “likes” on online media are converged. In particular, we detect the correlation between the user “likes” on online media and his profile in social network, suggesting that the user profile in social network can predict user's likes on online media. Based on the correlation, we apply the user's social network description to predict his “likes” on online media, resulting more than 17% improvement in precision compared with algorithms using only the user information from online media.

Select

Article

Consumption Intent Recognition Based on User Natural Annotation

FU Bo, CHEN Yiheng, SHAO Yanqiu, LIU Ting

2017, 31(4): 208-215.

Abstract ( ) PDF ( )

Knowledge map

Save

Consumption Intent refers to an exact indication of an immediate or future purchase in microblog. For example, a post like “I want to buy a mobile phone” indicates a buying intention. The paper proposes to study the problem of identifying consumption intent in microblogs based on user naturally annotated resources. Specifically, the proposed method recasts consumption intent recognition as a domain adaptation problem, and presents an approach utilizing automatic acquisition of large text corpora for classification. First, we look for a set of common features generalizable across domain adaptation, and then we extract the high confidence of pseudo annotation samples. Finally, we pick up useful features specific to the target domain. Experimental results show that the proposed method is effective for consumption intent recognition, achieving 69% and 77% in F-value, respectively. And, the features adopted are all contributive to the performance.

Select

Article

Comparison of User Influence Analysis Algorithms Based on Network Structure

CHEN Yiheng, LI Xueting, WANG Biao, LIU Ting

2017, 31(4): 216-222.

Abstract ( ) PDF ( )

Knowledge map

Save

The analysis of the user influence in the social network is a key research issue in social marketing. This paper is focused on several network structure based algorithms for user influence analysis, and conducts a contrastive study on their performances.

Please choose a citation manager

Content to export

2017 Volume 31 Issue 4 Published: 15 August 2017