Journal of Chinese Information Processing

Select

Review

Unsupervised Chinese Word Segmentation Based on HDP and
Mutual Information Getting together

CAO Ziqiang, LI Sujian

2013, 27(6): 1-6.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper explores Chinese word segmentation without training data, which greatly benefits the foundation of language-independent word segmentation system. Mutual information and HDP are both widely used methods for unsupervised segmentation task. We combine these two models and improve the sampling algorithm. Without regard to punctuations, the F-scores of two test corpus with different sizes are 0.693 and 0.741. Compared to HDP baseline, the scores rise 5.8% and 3.9%, respectively. Finally, our model is applied to semi-supervised word segmentation. The F-score is 2.6% larger than the common supervised CRF model.
Key wordsHDP; mutual information; unsupervised word segmentation

Select

Review

The Construction of a Segmented and Part-of-speech Tagged Archaic Chinese Corpus:
A Case Study on Huainanzi

LAU Kam tang1, 2, SONG Yan1, XIA Fei3

2013, 27(6): 6-16.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as by adding linguistic and morphological features derived from the characteristics of Archaic Chinese language. The experimental results showed the effectiveness of our approach. In particular, the domain adaptation techniques and the added features significantly improve POS tagging performance. During our manual correction, we categorize the errors resulted from the automatic segmentation and POS tagging process, and investigate the sources of those errors. Finally, we give the statistics of the resulted corpus on the distributions of words and POS tags. Our work is a preliminary study that could be easily extended to annotating other Archaic Chinese text, and the resulted corpus is a valuable resource for research on Archaic Chinese language.
Key wordsArchaic Chinese corpus; word segmentation; Part-of-speech Tagging; domain adaptation

Select

Review

Chinese Maximal Noun Phrase Recognition Based on Mixed Strategy

QIAN Xiaofei1, HOU Min2

2013, 27(6): 16-23.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposed a classifier ensemble method based on the language evaluation, and fused the MNP recognition results of SVMs and cascade CRFs based on reduction method, using the automatically obtained collocations and the manual assess rules. It then further targeted recognized the error-prone structures of the classifiers based on deterministic rules. The methods improve the recognition ability of boundary ambiguities of continuous verbs and prepositions as well as continuous nouns. The experiment is successful with a precision rate of 89.30% and a recall rate of 89.62%, especially it improves F1-score of multi-words MNPs by 0.75% in contrast with the reduction method.
Key wordsmaximal noun phrase recognition; language knowledge assess; classifier ensemble; rule

Select

Review

A Study of Chinese Semantic Knowledge System Based on the Theory of
Generative Lexicon and Argument Structure

YUAN Yulin

2013, 27(6): 23-31.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper discusses the construction of a practical Chinese semantic knowledge system and a corresponding database for the purpose of computing meaning in Chinese text. The 4-step working procedure is proposed as follows(1) Under the principles of Generative Lexicon Theory and Argument Structure Theory, the qualia structure of nouns and the argument structure of verbs and adjectives are described, including both the set of qualia roles or semantic roles and the syntactic constructions constituted by the nouns, verbs and adjectives. (2) The semantic orientation and sentiment polarity of the nouns, verbs and adjectives are indicated along a 5-point scale. (3) The inference relation of the qualia roles and semantic roles of related nouns, verbs and adjectives is revealed, which results in a lexical network. (4) The entity reference, conception relation and sentiment polarity of the nouns, verbs and adjectives are then integrated into a multi-level semantic knowledge database. Finally, a case study of computing meaning with the help of the multi-level semantic knowledge is presented.
Key wordssemantic description system; semantic knowledge database; qualia structure; argument structure; sentiment polarity; semantic correlation

Select

Review

Computing Lexical Semantic Relatedness with Chinese Wikipedia

WAN Fuqiang, WU Yunfang

2013, 27(6): 31-38.

Abstract ( ) PDF ( )

Knowledge map

Save

Lexical semantic relatedness plays an important role in natural language processing, such as information retrieval, word sense disambiguation and automatic text summarization and spelling correction, etc. In this paper, we employ Wikipedia-based Explicit Semantic Analysis to compute semantic relatedness between Chinese words. Based on Chinese Wikipedia, a word is represented as weighted vectors of concepts. Then,computing the semantic relatedness of words amounts to comparing the corresponding concept vectors. Furthermore, weadd the priori probability factor of concept and use the linking information among the Wikipedia pages to optimize the concept vectors. The experimental results show that the Spearmans rank correlation coefficient between the computed relatedness and human judgments reaches 0.52, which significantly outperforms the baseline.
Key wordssemantic relatedness; explicit semantic analysis; Chinese Wikipedia;priori probability; concept vectors

Select

Review

The Construction of Chinese Event Factuality Corpus

CAO Yuan, ZHU Qiaoming, LI Peifeng

2013, 27(6): 38-45.

Abstract ( ) PDF ( )

Knowledge map

Save

The factuality of an event is the degree of certainty to which an event is a factual one. In context, what expresses this attribute are the specific sentence structure and vocabularies. In this paper, we make the full study of the factors which influence Chinese event factuality, then present five kinds of factual related information of events and their annotation rules. Finally,we accomplish the annotation of the Movement event in the ACE 2005 Chinese Corpus and analyze the results, which is the foundational work of many information extraction applications.
Key wordsfactuality; corpus; annotation

Select

Review

Domain Determination Based on HNC Concept Association

CHI Zhejie1,2, ZHANG Quan2

2013, 27(6): 45-51.

Abstract ( ) PDF ( )

Knowledge map

Save

For Hierarchical Network of Concepts theory, domain is one of the main factors in Sentences Group Unit. Domain determination is an important issue in Sentences Group Unit Extraction. To determine the domain, we proposed a method using domain concepts and concept association expressions, which counted frequencies, merged concepts and summarized concepts in concept primitive space. For politics, economics and military domain, the experimental results show high performance in present method, the F₁ scores reach 90.61%, 90.83% and 90.99% respectively, which are 7.7%, 12.76% and 5.01% higher than the results with no concept association expressions. Finally, compared with the keywords-based method, the concept-primitives-based method shows high performance.
Key wordsconcept primitives;concept association expressions;domain determination

Select

Review

Chinese Discourse Relation Recognition

ZHANG Muyu, SONG Yuan, QIN Bing, LIU Ting

2013, 27(6): 51-58.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse Relation Recognition is one important part of discourse analysis. This paper focused on Chinese discourse relation recognition, including explicit discourse relation recognition and implicit discourse relation recognition. For explicit discourse relation recognition, we proposed a statistical method based on discourse connectives rules which got rather good results. For implicit discourse relation recognition, we combined lexical, syntactic and semantic features in a supervised model to classify implicit relations. The detail analysis and experiment results are useful and provide a baseline for future work on this task.
Key wordsChinese discourse semantic analysis; explicit discourse relation recognition; implicit discourse relation recognition

Select

Review

Joint Semantic Role Labeling and Coreference Resolution

XIONG Hao1,2, LIU Qun1, LV Yajuan1

2013, 27(6): 58-69.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic Role Labeling(SRL) and Coreference Resolution(CR) play important role in natural language processing applications. In this paper, we propose 8 rules to jointly learn and inference two tasks using markov logic network. Experimental results on OntoNote5.0 show that joint learning with markov logic network significantly improve 1.6 points in term of F score both on SRL and CR towards single systems.
Key wordssemantic role labeling; coreference resolution; markov logic network

Select

Review

Semantic Analysis of Interactive Speech Act Verbs on Conceptual Feature
and its Synset Construction

XIAO Shan1, GUO Tingting2

2013, 27(6): 69-75.

Abstract ( ) PDF ( )

Knowledge map

Save

Nature Language Processing (NLP) is one of the most important research parts of artificial intelligence. And the Words Semantic Knowledge Base (WSKB) construction is an important guarantee to make big progress on NLP. Nowadays, the domestic and abroad researches of network of words which based on Synset have several problemsLess stringent structure, rough semantic particle size, limited range of applications and so on. To build up a type of Multi-Dimension Word Net based on the detailed description of the features of concepts may solve these problems. This type of WSKB used the Synset-Lexeme Anamorphosis Method to analysis the relationship and distinctive features between basis lexeme and its concept anamorphoses. And it introduced the interactive speech act verbs based on characteristic sense analysis as examples, preliminary discuss the description of the lexical meaning structure and its synset construction rules.
Key wordssynset; evaluated speech act verbs; synset-lexeme anamorphosis method

Select

Review

Construction of Chinese Sentiment Lexicon using Bilingual Information
and Label Propagation Algorithm

LI Shoushan1,2, LEE Sophia Yat Mei2, HUANG Chu-Ren2, SU Yan1

2013, 27(6): 75-82.

Abstract ( ) PDF ( )

Knowledge map

Save

Currently, sentiment analysis has become a hot research topic in the natural language processing (NLP) field as it is highly valuable for many practice usages and theory studies. One basic task in sentiment analysis, named the construction of sentiment lexicon, aims to classify one word into positive, neutral or negative according to its sentimental orientation. However, there are two major challenges1) Chinese words are very ambiguities, which makes it hard to compute the sentimental orientation of a word; 2) Given the related research on sentiment analysis, available resource for constructing Chinese sentiment lexicons remains few. Note that there are several corpus and lexicons in English sentiment analysis. In this study, we first use machine translation system with bilingual resources, i.e., English and Chinese information, then get the sentiment orientation of Chinese words by the label propagation algorithm. Experiment results across four domains demonstrate that the lexicon generated with our approach reach an excellent precision and could cover domain information effectively.
Key wordssentiment analysis; bilingual; sentiment lexicon; label propagation algorithm

Select

Review

A New Emotion Dictionary based on the Distinguish of Emotion
Expression and Emotion Cognition

XU Ruifeng, ZOU Chengtian, ZHENG Yanzhen, XU Jun, GUI Lin, LIU Bin, WANG Xiaolong

2013, 27(6): 82-90.

Abstract ( ) PDF ( )

Knowledge map

Save

Current emotion dictionary usually annotates the categories and strength of emotion words, but lack of the capacity to distinguish emotional expression and emotional cognitive results. Meanwhile, the direct annotation on the word entries leads to emotion annotation ambiguities caused by the word sense ambiguities. Based on the analysis on the generation and migration mechanisms of individual emotions, this papers proposes a text emotion computing framework based on "cognitive stimulation - reflective expression" mechanism. Under this framework, we explores the construction strategy of a new emotion dictionary based on analyze the function and characteristics of emotion words. Firstly, we introduce the part-of-speech and word sense information provided by HowNet for separating one word to multiple entries corresponding to different part-of-speech and word senses in order to reduce annotation ambiguity. Secondly, the emotion expression categories and emotion cognition categories corresponding to each word entry are distinguished. The emotion categories and their corresponding strength values are annotated from different aspects, respectively. Meanwhile, the types of emotion expression and emotion cognition are refined annotated, respectively. Finally, a preliminary new type of emotion dictionary is constructed with a clear framework, rich emotional knowledge and low ambiguity.
Key wordsemotion dictionary; emotion cognition; emotion expression; word sense

Select

Review

Implicit Emotion Classification with the Context of Emotion Related Event

LI Shoushan1,2, LEE Sophia Yat Mei2, LIU Huanhuan1, HUANG Chu-Ren2

2013, 27(6): 90-96.

Abstract ( ) PDF ( )

Knowledge map

Save

Emotion classification is one basic task in emotion analysis, which has been a hot research issue in the community of Natural Language Processing. Precious studies often leverage the emotion keywords (e.g., happy, sad) to do emotion classification. However, there exists some text that includes no emotion keywords but does express emotions. We refer to the emotion expression without emotion keywords as implicit emotion expression. In this paper, we focus on the classification task on implicit emotion expression and propose a classification method with related events. We think that the related events are important indicates of the emotion categories. First, we collect the sentence groups that contains emotion keywords; Then, we delete the keywords and regard the context as describing the emotion related events. Third, we use the context as the feature source to perform emotion classification. Empirical studies demonstrate that using the context yields a nice performance for implicit emotion classification. This result provide a good basic for the studies on implicit emotion classification.
Key wordsemotion related events; emotion classification; emotion keywords

Select

Review

Feature Selection Method for Semi-Supervised Sentiment Classification

WANG Zhihao,WANG Zhongqing,LI Shoushan,LI Peifeng,SHI Hanxiao

2013, 27(6): 96-103.

Abstract ( ) PDF ( )

Knowledge map

Save

Feature selection aims to reduce the high-dimensional feature space so as to simplify the problem and improve the learning method. Existing studies have shown that feature selection is effective in reducing feature space in sentiment classification. In this paper, we focus on feature selection method. Different from all previous studies, we attempt to conduct the research on feature selection on semi-supervised sentiment classification. We propose a novel feature selection method based on bipartite graph which focuses on semi-supervised sentiment classification. First, we formulate the relations between documents and words with the help of bipartite graph model. Then, with a small amount of labeled data and the bipartite graph, a label propagation algorithm is applied to calculate the feature probabilities belonging to sentimental categories. Third, the features are then selected according the sentimental probabilities. The experimental results across multiple domains demonstrate that our feature selection method achieves much better performances than random feature selection method. Our approach is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance.
Key wordssentiment classification; semi-supervised learning; label propagation; bipartite graph; feature selection

Select

Review

Research on Orientation analysis of Opinion Phrases

HOU Min, TENG Yonglin, CHEN Yuqi

2013, 27(6): 103-110.

Abstract ( ) PDF ( )

Knowledge map

Save

The opinion phrases,as one of the opinion factors,is an important aspect of Chinese orientation analysis. The opinion phrases can be classified as 5 types,i.e. “opinion word + opinion word”,“modifier + opinion word”,“non-opinion word + opinion word”,“modifier + non-opinion word”,“non-opinion word + non-opinion word”. With each type,different orientation analysis strategy is applied on the basis of combination of applying phrase rules and opinion phrases lexicon. Phrase rules should be organized as specific rules and common rules. The establishment of opinion phrases lexicon should be obey the rule of minimum opinion factors. The experiment shows that the precision of orientation analysis is improved effectively with the applying of phrase rules and opinion phrases lexicon.
Key wordsopinion phrases; Sentiment Analysis; opinion phrases lexicon; opinion phrase rules; rule of minimum opinion factors

Select

Review

Chinese Comparative Sentence Identification Based On Multi-feature Fusion

ZHANG Chen, FENG Chong, LIU Quanchao, SHI Chao, HUANG Heyan, ZHOU Haiyun

2013, 27(6): 110-117.

Abstract ( ) PDF ( )

Knowledge map

Save

Opinions always carry important information of texts. Comparative sentence is a common way to express opinion. This paper described how to recognize comparative sentences from Chinese text documents by applying rule-based methods and statistical methods as well as analyze the performance of these methods. This method firstly normalized the corpus and its segmentation results, and then got the broad extraction results by using a lexicon-based method, sentence structure and dependent relationship analysis. Then a kind of CSR rule extraction algorithm was designed to extract the dependency relationship. The paper also used a CRF algorithm to identify entities and semantic roles. Finally, by using SVM classifier and choosing different feature dimensions the paper found the most optimum and effective features combination to finish the accurate extraction.
Key wordscomparative sentence;rule;CRF;SVM

Select

Review

Research on Key Factors in Multi-document Topic Modeling Application with HLDA

HENG Wei, YU Jia, LI Lei, LIU Yongbin

2013, 27(6): 117-128.

Abstract ( ) PDF ( )

Knowledge map

Save

The results of hLDA (hierarchical Latent Dirichlet Allocation) in the hierarchical topic modeling have been widely validated. In order to achieve semi-supervised or unsupervised learning, cross-validation or sampling super parameters are usually used to determine the true parameters. However, corpus features, modeling demand and some other factors are uncertain. Hence, parameter adjustment, modeling effectiveness and efficiency are difficulty to achieve in practical applications. This paper builds a unified analytical framework by combining Bayesian theory and boundary information, analyzes the key factors in its topic modeling, then gives a series of practical and effective modeling strategies and processes, and finally evaluates the modeling results with multi-document summary corpus from ACL MultiLing 2013.
Key wordsHierarchical LDA; Hierarchical Topic Modeling; Unified Analytical Framework

Select

Review

Pitch Declinatoin of Chinese Spontaneous Speech

WANG Maolin1, ZI Guangling1, XIONG Wei1, LIN Maocan2

2013, 27(6): 128-134.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, based on a telephone conversation corpus, the pitch declination of sentence is done. It is found that pitch declination occurs in most cases, which is due to physiological reason, and there is demarcative function as well. In some cases declination does not occur, and this is related to semantic strength, focus and tone. The pitch representation of statement and question is analyzed, and it is found that compared to statement, the pitch range is great for question. The pitch drop is the least between the final two syllables for yes-no question without final particle.
Key wordsspontaneous speech; pitch; declination

Select

Review

Filtration and Optimization for Hierarchical Phrase-based Model with Forced Alignment

FU Xiaoyin, WEI Wei, LU Shixiang, XU Bo

2013, 27(6): 134-139.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an effective method for filtering and optimizing hierarchical phrase-based (HPB) model. After obtaining the original HPB rules with traditional training method, we generate the bilingual derivation trees that represent source and target sentences with forced alignment, and then extract the HPB rules from derivation trees. At last, we re-estimated the probabilities of HPB rules with the extracted rules. This method does not need any linguistic knowledge, and it is suitable for large-scale training corpus. In the large scale Chinese-English translation tasks, our proposed method filters about 50% of the original HPB rules and improves the translation performance ranging from 0.8～1.2 BLEU on the test sets, comparing to the traditional training method.
Key wordsstatistical machine translation;hierarchical phrase-based model;forced alignment;model training

Select

Review

Phrase Table Filtration Based on Virtual Context in Phrased-Based
Statistical Machine Translation

YIN Yue, ZHANG Yujie, XU Jinan

2013, 27(6): 139-144.

Abstract ( ) PDF ( )

Knowledge map

Save

In statistical machine translation system, automatically extracted phrase table inevitably contains a large number of errors and redundant phrase pairs, which causes excessive waste of time and space in decoding and affects translation quality. In order to solve this problem, we propose a method for filtering phrase table in which virtual context is introduced to calculate an incremental quantity in score of phrase pair from language model. By considering the maximum and minimum incremental quantity in score from the virtual context, we design a filtering strategy by re-ranking phrase pairs. We conducted experiments on NTCIR-9 Chinese-English data to verify the method. The experimental results show that when the size of phrase table was reduced to 47% of the original, the translation quality was improved slightly; when the size was reduced to 30% of the original, only slight decline occurred in translation quality. The experimental results indicate that this method can effectively filter out the redundant phrase pairs of the phrase table.
Key wordsphrase-based statistical machine translation, filter phrase table, virtual context

Select

Review

Selection of Parallel Corpus Based on Classification

WANG Xing1,TU Zhaopeng2, 3,XIE Jun2,LV Yajuan2,YAO Jianmin1

2013, 27(6): 144-151.

Abstract ( ) PDF ( )

Knowledge map

Save

Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 40% less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.
Key wordsstatistical machine translation; bilingual corpus selection

Select

Review

Automatically Extracting Phrase-level Paraphrases from Chinese-English Parallel Patents

LI Li, LIU Zhiyuan, SUN Maosong

2013, 27(6): 151-158.

Abstract ( ) PDF ( )

Knowledge map

Save

Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
Key wordsphrase-level paraphrase; statistical machine translation; chunk parsing; distributional similarity

Select

Review

Alignment and Annotation of Chinese-English Discourse Structure Parallel Corpus

FENG Wenhe

2013, 27(6): 158-165.

Abstract ( ) PDF ( )

Knowledge map

Save

Discourse structure parallel corpus is a corpus annotated with parallel discourse structure information for bilingual text. This paper proposes such an alignment and annotation strategy, the structural and relational alignment, which is the theoretical basis of Chinese-English discourse structure parallel corpus. This strategy is applied to the corpus building process, including segmental, structural, relational, and central alignment, having achieved an operation mode of parallel corps along with alignment and annotation working together, as well unit alignment and structural alignment. The strategy with the help of corresponding annotation software and the solutions to the difficulties has been proved to be an effective operation mode for discourse structure parallel corpus.
Key wordsparallel corpus; alignment; discourse structure

Select

Review

Tibetan Functional Chunks Boundary Detection

LI Lin1,2, LONG Congjun2,3, JIANG Di2

2013, 27(6): 165-169.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan functional chunks describe a sentence skeleton, and they are the link between sentence structure and semantics. In this paper, we proposed primary functional chunks in Tibetan and a functional chunk tag system. Based on the theory, a functional chunk boundary detection algorithm was proposed. Experiments on a limited scale data suggest that the algorithm is capable of recognizing most boundaries and deserves to be studied deeply.
Key wordsTibetan functional chunks; chunks boundary detection; CRFs

Select

Review

A Comparison Study on Word Coding Methods for Mongolian IME

BAI Shuangcheng1,2,3, ZHANG Jinsong1, Husile2,3

2013, 27(6): 169-175.

Abstract ( ) PDF ( )

Knowledge map

Save

Word coding, representing the mapping between a word and a series of keyboard inputs here, is very important for efficient Mongolian IME. Based on the criterions of candidate duplications and average word coding length, this paper made a comparison study on the efficiencies of 7 kinds of coding methods belonging to 3 classes, and proposed a new syllable based Fuzzy input method. Experimental results showed that the method is not only easy for users to memorize, but also very efficient to use.
Key wordsMongolian; IME; composition string; fuzzy input

Select

Review

A Spelling Correction Method for Traditional Mongolian Based on
Statistical Translation Framework

SU Chuanjie, HOU Hongxu1, YANG Ping1,2, YUN Huarui1

2013, 27(6): 175-180.

Abstract ( ) PDF ( )

Knowledge map

Save

In traditional Mongolian electronic textsencoded inUnicode, spelling errors are very common. The cost of correcting spelling errors artificially is extremely high. This paper proposed an automatic spellingcorrection method for traditional Mongolian based on statistical machine translation framework, and we regardspelling correction task as a translation work which translates the wrong words to the correct words. This paper used the improved phrase-based statistical machine translation model to build spelling correction model. We use this model tocorrect the rawtext. We used atest set whichcontained 1 026 correct words and 1 102 wrong words to test our method, Experimental results show that our method can correct spelling errors quickly and efficiently without special language knowledge. The percentage of correct words in ourproofreadtextcan reach to 97.55%.
Key wordsMongolian; spelling check; spelling correction; machine translation

Select

Review

An Investigation Research on the Similarity of Uyghur Kazakh Kyrgyz
and Mongolian languages

WANG Ling2, DAWA Yidemucao1,2, WU Shouer Silamu1,2

2013, 27(6): 180-187.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper, an investigation is done for the similarity between the same family and agglutinative languages (such as Altai family languages ,for example, Uyghur, Kazakh, Kyrgyz and Mongolian using different countries and areas ). Cosine similarity measure is used to calculate the similarity using the parallel texts and the acoustic features extracted from the same content speech sentences spoken by the different language speakers. Experimental results show that the transformation is more feasible by word to word units when learning the connection rule of a stem and an affix (function words) between languages by word level and common acoustic models. Thus, this avoids the uphill work of MT for the resource-deficient languages such as minority languages being used in the developing countries. Additionally, the costs can be reduced.
Key wordssame family and agglutinative language; parallel text; acoustic and prosody parameters; F₀; similarity

Select

Review

Daiwen Word Segmentation System Design and Implementation

GAO Tingli1, TAO Jianhua1, DAI Hongliang2, LI Ya1

2013, 27(6): 187-192.

Abstract ( ) PDF ( )

Knowledge map

Save

Daiwen word segmentation is the basis for Daiwen information processing work. Its the basic work for Daiwen input method, Daiwen machine translation system development, daiwen text information extraction and other information processing words. Limited by Daiwen corpus technology, Daiwen natural language processing technology is relatively weak. This paper first analyzes the characteristics of Daiwen, and on this basis, build a Daiwen corpus, then, applied Chinese word segmentation method to Daiwen segmentation, combined with its own characteristics, Designed an Daiwen word segmentation system based on the sequence annotation. Through experiments, the segmentation system has reached a comprehensive appraisal 95.58%.
Key wordsDaiwen; segmentation; CRF; absolute segmentation word

Select

Review

Japanese Time Expression Recognition by Combining Rules with Statistics

ZHAO Ziyu,XU Jin’an,ZHANG Yujie,LIU Jiangming

2013, 27(6): 192-201.

Abstract ( ) PDF ( )

Knowledge map

Save

Based on the knowledge base we defined, this paper presents a Japanese time expression recognition method throughcombining rules setstrengthened by knowledge base with statistical model.According to the Timex2 standards’ granular classification on time, we progressivelyexpanded and reconstructed the knowledge base given the Japanese time characteristic, and then achieved rules set optimization and update, in order to increase recognition accuracy. Simultaneously, we fused CRF model to enhance the generalization ability of Japanese time expression recognition. Our experimental results show that the F1 value reaches0.8987 on open test.
Key wordsknowledge base; rules set; statistical model

Select

Review

A Study on Vietnamese Frame Semantic Annotation Based on the News Corpus

LIN Li

2013, 27(6): 201-209.

Abstract ( ) PDF ( )

Knowledge map

Save

Vietnam is an important neighboring country of China, the corresponding massive information processing has become increasingly necessary and important. By referring to the relevant studies and practices on the Frame Semantic annotation at home and abroad, we built a Vietnamese News corpus. On the basis of text segmentation and part of speech tagging, named entity tagging, we tried to build Vietnamese FrameNet and initially explored the application of the Frame semantic annotation in the Vietnamese news event extraction.
Key wordsFrame semantic; annotation; Vietnamese; News

Please choose a citation manager

Content to export

2013 Volume 27 Issue 6 Published: 16 December 2013