Journal of Chinese Information Processing

Select

Review

Unsupervised Feature Learning for Chinese Lexicon
Based on Auto-Encoder

ZHANG Kaixu, ZHOU Changle

2013, 27(5): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

Large-scale unlabeled data contains abundant lexical information for NLP tasks such as Chinese word segmentation and POS tagging. This work extracted high-dimensional distributional lexical information from a large-scale unlabeled Chinese corpus. An auto-encoder then performed the unsupervised dimension reduction. The learned low-dimensional lexicon features were used as new lexical features for a joint Chinese word segmentation and POS tagging task. Experiments on the Chinese Treebank 5 corpus showed that the additional lexicon features improve the performance and are better than those features learned by using the principal component analysis and the k-means algorithm.
Key wordsunsupervised feature learning;Chinese word segmentation;part-of-speech tagging

Select

Review

Chinese Word Segment Based on Character Representation Learning

LAI Siwei, XU Liheng, CHEN Yubo, LIU Kang, ZHAO Jun

2013, 27(5): 8-15.

Abstract ( ) PDF ( )

Knowledge map

Save

Word segmentation is a fundamental technology of Chinese natural language processing. Using character-based statistical machine learning methods to perform Chinese word segmentation is the main trendcurrently. However, conventional machine learning methods heavily rely on manually designed features, which require intensive labor to modify the features and verify their effectiveness. With the rapid develop of neural-network-based representation learning, it becomes realistic to learn featuresautomatically. This paper investigates a Chinese word segment method based on representation learning. We first learn embedding vectors for Chinese characters from a large corpus unsupervisedly, and then apply them to neural-network-based Chinese word segmentation supervisedly. Experimental results show that representation learning is an effective method for Chinese word segmentation. However, due to the limitation of corpus size, it still cannot replace conventional machine learning methods whichbased on manually designed features.
Key wordsrepresentation learning; Chinese word segmentation

Select

Review

The Segmentation and Recognition of Four-Character Idioms in Multilingual Corpora

XU Runhua1, QU Weiguang2, CHEN Xiaohe3, WANG Dongbo4

2013, 27(5): 15-22.

Abstract ( ) PDF ( )

Knowledge map

Save

The productive and derivative of four-character idioms are extremely high, the use of four-character pattern to create new words in the vocabulary of modern Chinese is still on the rise. This article looks into the eyes of the large number of four-character idioms in word-segmented corpora, and works on the four-character idioms in corpora for analysis and induction. Then this article works on the segmented comparison of four-character idioms both in single segmented corpora and between different segmented corpora. Finally, through the introduction of CRF statistical model, and take the results of segmented comparison of four-character idioms as training corpora, this article develops the research of the recognition of four-character idioms in corpora. Recognition results show the accuracy of four-character idioms can reach more than 93% in both closed test and open test.
Key wordsfour-character idioms; word-segmented corpora; segmented comparison; CRF

Select

Review

Automatic Identification of Chinese Dialect Based on the Data from
Chinese Pinyin Input Method

ZHANG Yan1, ZHANG Yang2, SUN Maosong1

2013, 27(5): 22-29.

Abstract ( ) PDF ( )

Knowledge map

Save

The study of dialect is composed of voice study, vocabulary study and grammar study, of which the first step is to recognize the dialect vocabulary. By now, collection of Chinese idiom words is mainly accomplished by experts, and it is time-consuming and labor-intensive. With the development of information technology, people communicate widely through the network, and thus input method data contains vast amount of vocabulary resources as well as the geographical information, which can help automatically discover dialect words corpus. However, in literature, there have been very few studies on how to exploit the input method data to systematically investigate the dialects. Therefore this paper analyzes the user behavior of Chinese input method, and based on which we propose to automatically discover the geographical dialect vocabulary. Specifically, the paper gets the two representative features of dialects in Chinese input method, and uses different combinations of these two features to recognize dialect words. Finally, extensive experiments are performed to evaluate the impacts of the feature combinations on the dialect word recognition.
Key wordsdialect detection; Chinese Pinyin input method; feature combination

Select

Review

Chinese Named Entity Recognition and
Disambiguation Based on Multi-stage Clustering

LI Guangyi, WANG Houfeng

2013, 27(5): 29-35.

Abstract ( ) PDF ( )

Knowledge map

Save

Named Entity Recognition and Disambiguation is an important research of Natural Language Understanding. For the task of Named Entity Recognition and Disambiguation in the situation of entity knowledge base provided, this paper presents a method based on multi-stage clustering. First, we link the document to the entity definition in the knowledge base by two rounds of clustering. Second, we group entities which dont exist in the knowledge base by Hierarchical Agglomerative Clustering. Finally, we recognize ordinary words and adjust the results by K-Means Clustering. Our experiments on data of CLP-2012 Chinese person name disambiguation task proves our system performs well. The F score on test data is 86.68%, exceeding the best result of the Bake-off by 6.46%.
Key wordsnamed entity recognition; name entity disambiguation; clustering

Select

Review

Studies on the Application of Chinese Functional Words Usages in Dependency Parsing

ZAN Hongying, ZHANG Jingjie, LOU Xinpo

2013, 27(5): 35-43.

Abstract ( ) PDF ( )

Knowledge map

Save

Functional words play an important role in modern Chinese words, and constitute the syntactic means of Chinese with the word order, so functional words have an important influence on the syntactic analysis. Dependency parsing is a hotspot of research in the field of natural language processing. In order to improve the recognition effect of dependence relation, the usages of functional words are applied to the recognition process of dependence relation in this paper. Through the study of functional words usages, as well as the analysis of dependence relation in the dependency parsing, it found that the coordination relation has close connection with conjunction. And then, the conjunction usages are considered in the recognition process of coordination relation to improve the recognition performance. The experimental results show that, through considering the conjunction usages, the LAS and the UAS of coordination relations have increased 3.43% and 2.29% respectively.
Key wordsfunctional words usages; dependency parsing; coordination relations

Select

Review

Analyzing the Linguistics Features of Coordination with Overt Conjunctions
Based on Chinese Patent Literature

SHI Cui1,2, ZHOU Qiaoli1, ZHANG Guiping1

2013, 27(5): 43-51.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the Chinese patent corpus, this paper counts and analyzes the internal and external features of Coordination with Overt Conjunctions (COC) in the Chinese patent literature. It mainly investigates the internal features including coordination tag, internal analysis of coordination structure and the distribution of Part-Of-Speech (POS). Its mainly counted the candidate boundary markers by the external features, and analyzes the contextual information of the coordinate structures in the Chinese patent literature.
Key wordsCOC; Chinese patent literature; internal features; external features

Select

Review

Graphical Model Based Semantic Role Labeling Reranking

XIONG Hao1,2, LIU Qun1, LV Yajuan1

2013, 27(5): 51-60.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional methods for Semantic Role Labeling (SRL) generally utilize some local features to identify and classify the semantic roles which are hard to capture labeling inconsistency. In this paper, we propose a graphical model to rerank the results via label propagation algorithm. Experimental results on PropBank show that our models significantly improve the performance 2.4 points in term of F score, and obtain the best results on this data set without using any system combine techniques.
Key wordssemantic role labeling; graphical model; reranking

Select

Review

Semantic Labeling of Chinese Serial-Verbs Sentence Based on Feature Structure

CHEN Bo1,2, JI Donghong2, LV Chen2

2013, 27(5): 60-67.

Abstract ( ) PDF ( )

Knowledge map

Save

Parsing Chinese special sentence patterns is one of the major tasks in Chinese Information Processing. The Current conventional approaches in semantic parsing have some defects in that they cannot denote the semantic relatedness between Chinese words and constituents. In the paper, we choose “Serial-Verbs Sentence” as the research target, propose a semantic annotation approach based on Feature Structure, and study the semantic annotation models of “Serial-Verbs Sentence”. Feature Structure Model provides a different semantic parse approach for Chinese Information Processing, which can represent the complicated semantic relations among the subject, the predicates and the objects of “Serial-Verbs Sentence”.
Key wordsfeature structure; Chinese Serial-Verbs Sentence; semantic annotation; semantic resource

Select

Review

A Sentiment Classification Approach Based on Sentiment Sentence Framework

CHEN Tao1,2, XU Ruifeng1, WU Mingfen2, LIU Bin1

2013, 27(5): 67-75.

Abstract ( ) PDF ( )

Knowledge map

Save

Considering that opinionated sentences always have the same or similar syntax and semantic expression frameworks, this paper proposes a sentiment analysis approach based on sentiment sentence framework. Firstly, we divided sentiment sentence frameworks into three categories and 105 subcategories. A sentence framework extraction method is designed to semi-automatically extract sentiment sentence frameworks from annotated sentiment sentences using dependency features, syntactic features and synonym features. The polarity of input sentence is determined through the classification of its sentiment sentence frameworks. The evaluations on NLP&CC 2013 micro-blog emotion analysis corpus and RenCECps blog emotion corpus show that our proposed sentiment classification approach achieves better precision performance compared to word-based support vector machine classifiers.
Key wordssentence framework; sentiment classification; syntactic feature; dependency feature

Select

Review

Auto-construct of Sentiment Ontology Tree for Fine-grained Opinion Mining

GUO Chong1,WANG Zhenyu2

2013, 27(5): 75-84.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper defines the concept of sentiment ontology tree, which organizes the evaluation pair and hierarchy between product aspects in fine-grained opinion mining. And proposes an auto-construct method to build the sentiment ontology tree. We focus on the evaluation pair extraction, orientation predict of evaluation pair and aspect aggregation. Experimental results show that our algorithm is proper and efficient.
Key wordssentiment ontology tree; evaluation pair; orientation predict; aspect aggregation

Select

Review

A Method for Chinese Opinion Sentence Identification
Based on the Ensemble Classifier with BootStrapping

LV Yunyun1, LI Yang1, WANG Suge1,2

2013, 27(5): 84-93.

Abstract ( ) PDF ( )

Knowledge map

Save

The large scale and high quality domain training data is an important guarantee for constructing a high performance classifier. However, it is an expensive work to label a large scale corpus in a domain. In this paper, we propose a method for identifying Chinese opinion sentences using a small-scale labeled corpus. At first, the method uses BootStrapping to expand the small-scale labeled corpus. Using the expanded labeled corpus we then train three classifiers that are based on naive Bayes, support vector machine and maximum entropy respectively. At last, an ensemble classifier is obtained by assigning a set of probability weights to the three trained classifiers. Experimental results indicate that the ensemble classifier is superior to the three single classifiers. And the proposed method can achieve the analogous experimental results by using partially labeled training data or using totally labeled training data.
Key wordsopinion sentence identifying; BootStrapping; ensemble classifier

Select

Review

Detecting Emotion Cause with Sequence Labeling Model

LI Sophia Yat Mei1, LI Shoushan1,2, HUANG Churen1, GAO Wei2

2013, 27(5): 93-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Emotion cause detection is an important task in the research on emotion analysis. This task aims to detect the cause description of a emotion happening. In this study, we model this task as a sequence labeling problem and predict each related sentence to be in a emotion cause or not. Specifically, we apply the conditional random field (CRF) model to solve this problem with various of features, such as basic word features, POS features, context features and linguistic rule features. Empirical studies demonstrate that these features are effective for the task, especially the context features. Moreover, we find that the sequence labeling model is superior to the classification model when similar features are employed.
Key wordssequence labeling; emotion cause detection; context feature; linguistic rule features

Select

Review

Ensemble Learning Based Essay Automated Scoring Algorithm for Chinese English Learners

LI Xia1,2,LIU Jianda2

2013, 27(5): 100-107.

Abstract ( ) PDF ( )

Knowledge map

Save

Nowaday, there are a large number of Chinese English learners in China, the substantial quantities and great difficulties in English writing assessment is now the bottleneck problem in English teaching and testing. So the effective automatic essay scoring algorithms are in great need of in China. In this paper, we first propose a feature selection method which can extract Chinese learners writing characters effectively and automatically. And then we continue propose a resemble learning based essay automatic scoring algorithm for unbalanced essays data. The classification results on 1 115 university CET4 and CET6 essays from CLEC shows that our algorithm has dramatically promotion in precision, recall, and F-meature value compared with classification for balanced data.
Key wordsessay automatic scoring; unbalanced data classification; multinorminal nave bayes

Select

Review

The Automatic Acquisition of Pre-Qin Word s Property of Times
and The Automatic Classification of Documents Times

LIU Liu1, LI Bin1,2, QU Weiguang3, CHEN Xiaohe1

2013, 27(5): 107-114.

Abstract ( ) PDF ( )

Knowledge map

Save

Words property of times shows rules of how a word changes in a particular times. We divide the Pre-Qin times into three parts as Pre-Chunqiu, Chunqiu and Zhanguo. We find out and focus on three kinds of words which are only in a times, popular in a times and arised in a times. We also propose methods using VSM and Naive Bayes Classifier to decide the times of a text with which we experiment on 25 texts of Pre-Qin. The latter one s result turn out much better. With the same method we verified that Lie Zi is not written in Pre-Qin.
Key wordsPre-Qin words; times; VSM; Naive Bayes classifier

Select

Review

Topic Label Extraction Based on Seed Words

KOU Wanqiu, LI Fang

2013, 27(5): 114-122.

Abstract ( ) PDF ( )

Knowledge map

Save

Traditional topic models use word probability distribution to represent topics. These words are difficult to be understandable and express a consistent meaning. This paper proposed a topic label extraction method based on seed words. The method first extracts topic seed words according to weight formulas, then uses bootstrapping algorithm to generate a key phrase set that contains seed words. Finally, the method selects topic label from the key phrase set according to the integrity and generalization of a phrase. The experiments were made on two corpora. One is topic oriented reports, the other is event based news reports. According to the experimental results, the method work well in extracting a meaningful phrase to represent a topic.
Key wordstopic labelling; seed words extraction; bootstrapping method

Select

Review

A Multi-layer Markov Network Information Retrieval Model Based on Iteration

HONG Huan, WANG Mingwen, WAN Jianyi, LIAO Yanan

2013, 27(5): 122-129.

Abstract ( ) PDF ( )

Knowledge map

Save

Query expansion is an effective way to improve the retrieval effectiveness, traditional query expansion methods mostly extend the query words only considered the relevance of a single query word, without fully considering the relevance between terms, documents, as well as between queries, so this makes the expansion effect poorly. To solve this problem, first, we construct the Markov network of terms and documents subspace for extracting the maximum term cliques and document cliques, then, we divide the maximum word cliques into documents dependent word cliques and non-documents dependent word cliques through the mapping relation between term and document cliques, and build the Markov network retrieval model based on document cliques dependency to do the initial search, then we construct the Markov network of queries subspace from the search results, which are used for extracting the maximum query cliques, finally, we calculate the probability between document and query in an iterative method, and build the final multi-layer Markov network information retrieval model based on iteration. Experimental results show that our model can improve the retrieval results.
Key wordsMarkov network; query expansion; document reliance; clique; information retrieval

Select

Review

Event Semantic Feature Based Chinese Textual Entailment Recognition

LIU Maofu1,2, LI Yan1,2, JI Donghong3

2013, 27(5): 129-137.

Abstract ( ) PDF ( )

Knowledge map

Save

In order to strengthen deep semantic analysis and inference of textual entailment, this paper proposes the method of event semantic feature based Chinese textual entailment recognition. The method generates event graphs base on event labeled corpus, and then the entailment recognition between text pairs can be changed to entailment recognition between event graphs. The event semantic feature can be computed based on max-common sub-graph. The event semantic feature combined with the surface statistical feature, lexical semantic feature and syntactic feature is used to classify textual entailment based on support vector machine and can obtain the preliminary experimental result, and the correction module based on event semantic rules handles preliminary experimental result to get the final experimental result. The experimental results show that the event semantic feature based Chinese textual entailment recognition is effective and efficient in Chinese textual entailment recognition.
Key wordsTextual Entailment; Event Semantic Feature; Max-Common Sub-graph, Support Vector Machine

Select

Review

Hedge Scope Detection Based on Syntactic Structural Constraints

ZHOU Huiwei,YANG Huan,HUANG Degen,LI Yao,LI Lishuang

2013, 27(5): 137-144.

Abstract ( ) PDF ( )

Knowledge map

Save

Hedge scope detection is used to distinguish factual information and uncertain information, which could improve the authenticity and reliability in information extraction. Hedge scope detection is a difficult task because of its dependency of the semantic and syntactic structures. In this paper, we propose a hedge scope detection method based on syntactic structural constraints. First, two decision trees are constructed on dependency structure and phrase structure respectively to build the syntactic constraint set. And then the hedge scope detection results based on the syntactic constraint set are used as the syntactic constraint features for Conditional Random Fields (CRF) models. Experiments on the CoNLL-2010 corpus achieve the 70.28% F-score on the golden standard hedge cues, which is 4.22% higher than the system with the common syntactic construction features.
Key wordshedge scope detection; syntactic structural constraints; decision tree; conditional random fields

Select

Review

Chinese Field Entity Relation Extraction Based on Convex Combination Kernel Function

CHEN Peng1,GUO Jianyi1,2, YU Zhengtao1,2, XIAN Yantuan1,2, YAN Xin 1,2, WEI Sichao1

2013, 27(5): 144-149.

Abstract ( ) PDF ( )

Knowledge map

Save

For the problem that based on the feature method, different kernel functions caused different performances in Chinese field entity relation extraction by the machine learning method, which supports kernel function, this paper proposed a convex combination kernel function method to deal with this problem. First, this paper chose lexical information, phrase syntactic information and dependent syntactic information as features. Next step was to get different high-dimensional matrixes though mapping by different convex combination kernel functions. Finally, we could get the optimal kernel by testing all classified model that trained all high-dimensional matrixes by SVM. This paper conducted the relation extraction experiment on collecting 600 corpuses in tourist field, the experimental result shows that the optimal convex combination kernel function this paper presents can effectively improve the extraction performance, and it gets the best F value which reaches 62.9.
Key wordsrelation extraction; convex combinationm kernel function; support vector machine

Select

Review

Graph Based Alias Extraction Using Query Log

SHI Bei, SUN Le, HAN Xianpei

2013, 27(5): 149-156.

Abstract ( ) PDF ( )

Knowledge map

Save

The alias of entity means the different names which refer to the same entity. Traditional alias extraction methods often have two problems1) the difficulty of constructing training corpus; 2) the lack of timeliness. To resolve the two problems, this paper proposes a graph based alias extraction method using query log. This method uses context information and query-link information, constructs a two-layer graph (including the candidate alias layer and the query-link layer) and sorts the alias using random walk algorithm. The experimental results show that1) our method achieves the accuracy of 71.8%, which proves our method is effective. 2) Using query-link information outperforms the method which uses context information and the combination of this two type s information improves the performance.
Key wordsquery log; alias extraction

Select

Review

Research on History-based Mongolian Automatic POS Tagging

ZHAO Jiandong, GAO Guanglai, BAO Feilong

2013, 27(5): 156-160.

Abstract ( ) PDF ( )

Knowledge map

Save

The researches on Mongolian machine translation, syntax analysis and semantic analysis are restricted because of few researches on Mongolian automatic Part-Of-Speech (POS) tagging. In view of this, we proposed a history-based Mongolian automatic POS tagging method which incorporating a lookahead mechanism into the decision making process. Experiment results showed that the POS tagging accuracy of Mongolian unknown words, known words and all words are 71.276 6%, 99.148 2% and 95.301 0%, respectively, which demonstrate that our method is appropriate for Mongolian automatic POS tagging.
Key wordsHistory-models; learning with lookahead; Mongolian; automatic POS tagging

Select

Review

Fusion of Syllable Features for Tibetan Part of Speech Based on Maximum Entropy Model

YU Hongzhi1, LI Yachao1, WANG Kun2,TASHI Lengben1

2013, 27(5): 160-166.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan Part of Speech (POS) is an important problem for Tibetannatural language processing, the paper studies the fusion of morphologicalfeatures for Tibetan part of speech withmaximum entropy model, based on the analysis of Tibetan scripts and the result of statistics, and define the feature templates. Experimental results show that, Tibetan POS with maximum entropy achieves much better results,syllable features can increase the performance of Tibetan POS significantly, and obtain an error reduction of 6.4% compare to the baseline.
Key wordsTibetan; part of speech; maximum entropy; morphological features

Select

Review

Semi-Automatic Building Tibetan Treebank Based on Word-Pair Dependency Classification

HUA Quecairang1,3,JIANG Wenbing2,ZHAO Haixing1,LIU Qun2

2013, 27(5): 166-173.

Abstract ( ) PDF ( )

Knowledge map

Save

According dependency syntactic theory this paper gave Tibetan typed dependencies and its hierarchy, and then we analyzed some problems in building Tibetan dependency Treebank. We proposed a mode to construct dependency tree semi-automatically, it includes word-pairs dependency classification model and dependency edges annotation model with rich features template based on Tibetan language grammar. And we implemented visualized tool which used to build and proofreading 11 thousand sentences Treebank. On the baseline system the experimental results show that, the dependency recognition accuracy obtains an improvement of 3%.
Key wordsTibetan dependency syntax; word-pair dependency classification; Tibetan Treebank; Tibetan dependency annotation tool

Select

Review

Recognition of Chinese Loan Words in Uyghur Based on String Similarity

MI Chenggang1,2, YANG Yating1, ZHOU Xi1, LI Xiao1, YANG Mingzhong3

2013, 27(5): 173-179.

Abstract ( ) PDF ( )

Knowledge map

Save

There are many Out-Of-Vocabulary words in Uyghur-Chinese machine translation, a large part of them are loan words (including person names, place names, et.al). This paper presents a novel method that recognition the Chinese loan words in Uyghur according to the feature that one loan word pronounce similar with its original word. This method training the existing corpus first, and getting the Uyghur Latin rules that use to recognize Chinese loan word in Uyghur; this paper Latin the Uyghur words according to the rules, Romanization of Chinese words, these transform the sounds similarity to strings similarity which is easy to quantification; proposed three modelsPosition-related Minimum Edit Distance model, Weighted Common Subsequence model and the fusion model that fused above two with parameters. The experimental results show that the fusion model considering strings global similarity and local similarity, so it gets the best recognition results.
Key wordsloan words; Out-Of-Vocabulary words; pronunciation similarity; string similarity

Select

Review

The Uyghur POS-Tagging Method Based on Functional Suffix Strings

WANG Haibo1,3,ZU Yiqing2,LITIFU·Tuohuti3

2013, 27(5): 179-184.

Abstract ( ) PDF ( )

Knowledge map

Save

As a typical agglutinative language, Uyghur have rich suffixes to express syntax and mood. This paper contrast two kinds of POS-Tagging method in Uyghur language processingone is POS-Tagging based on the stem words,the other is based on the suffixes. We statistics the sum, the frequency, and the cover degree of common functional suffix strings in a big corpus, aim to judge the feasibility of POS-Tagging method based on suffix strings. We define the regulation of suffix POS-Tagging based on the theory of Prof. Litip Tohti and label some corpus based on this kind of POS-Tagging definition, which is not only useful to Uyghur, but also to other Turkic languages which have much similar suffixes.
Key wordsUyghur; suffix strings; POS-Tagging

Select

Review

Chinese Word Segmentation Method for Domain-Special Machine Translation

SU Chen1, ZHANG Yujie1, GUO Zhen1, XU Jin’an1

2013, 27(5): 184-191.

Abstract ( ) PDF ( )

Knowledge map

Save

In developing a domain-specific Chinese-English machine translation system, the accuracy of Chinese word segmentation in large-scale training corpus often decreases because of unknown words. The lack of domain-specific annotated corpus makes supervised learning approaches unable to adapt. This problem results in many errors in translation knowledge extraction and therefore seriously affects translation quality. To resolve the domain adaptation problem, we implemented Chinese word segmentation by exploiting n-gram statistical features in raw corpus and bilingually motivated word segmentation information in parallel corpus, respectively. We further propose a lattice-based method to combine multiple results and use dynamic programming algorithm to get the best word segmentation result. For evaluation, we conducted experiments of Chinese word segmentation and Chinese-English machine translation using the data of NTCIR-10 Chinese-English patent task. The experimental results show that the proposed method brought about improvements both in F-measure of the Chinese word segmentation and in BLEU score of the Chinese-English statistical machine translation system.
Key wordsChinese word segmentation; domain adaptation; bilingual motivation; Lattice; machine translation

Select

Review

Cross-Lingual Relation Extraction Via Machine Translation

HU Yanan, SHU Jiagen, QIAN Longhua, ZHU Qiaoming

2013, 27(5): 191-198.

Abstract ( ) PDF ( )

Knowledge map

Save

The scale of training corpus plays an important role in machine learning-based semantic relation extraction between named entities, however, the annotation of corpus is time-consuming and labor-intensive. In order that a resource-rich language can help a resource-poor language in semantic relation extraction, we propose an approach to transforming relation instances from the source language to the target language via machine translation, and then add them into the training corpus of the target language by way of entity alignment. The experiments on the ACE2005 Chinese and English corpora show that, Chinese and English can help each other in relation extraction. Furthermore, this help is particularly significant especially when the scale of training corpus in target language is small.
Key wordsCross-lingual relation extraction; machine translation; entity alignment

Select

Review

Reordering for Chinese-Mongolian SMT Based on Small Parallel Corpus

CHEN Lei, LI Miao, ZHANG Jian, ZENG Weihui

2013, 27(5): 198-205.

Abstract ( ) PDF ( )

Knowledge map

Save

The reordering models are significant in reducing the difference of word orders between the language pairs in statistical machine translation. Most reordering approaches have high requirements of the scale of the parallel corpus in statistical machine translation. Chinese minority language resources are very scarce and difficult to achieve substantial growth in a short time. Therefore the current reordering approaches cannot play good effect in the translations between Chinese and minority languages. After analyzing the related studies, the paper proposes a source-side reordering method based on a small parallel corpus. In virtue of the linguistic knowledge, we analyzed both corpus and translations to obtain the verb phrases which affected the word orders of translations evidently. And then we studied the reordering rules of these verb phrases, including manually written rules and automatically extracted rules. Experiments show that our method can improve the performance of the state-of-the-art phrase translation models.
Key wordsstatistical machine translation; reordering; verb phrase; small parallel corpus

Please choose a citation manager

Content to export

2013 Volume 27 Issue 5 Published: 15 October 2013