Journal of Chinese Information Processing

Select

Review

A Unified Character-Based Tagging Approach to Chinese Lexical Analysis

YU Jiangde,HU Shunyi,YU Zhengtao

2015, 29(6): 1-7.

Abstract ( ) PDF ( )

Knowledge map

Save

To integrate multi-information without error accumulation in the pipeline approach, a unified character-based tagging approach is proposed for Chinese lexical analysis, including word segmentation, part-of-speech tagging and named entity recognition. Treating Chinese lexical analysis as a character sequence tagging problem, each character tagging could be integrated with three kinds of information that is word-position, part-of-speech and named entity. After the tagging process, the maximum entropy model is applied to complete the three subtasks. The closed evaluation is performed on PKU corpus from Bakeoff2007, and the results show a F-score of 96.4% on word segmentation, 95.3% on POS tagging and 90.3% on named entity recognition.
Key words Chinese lexical analysis; maximum entropy model; trinity; character-based tagging

Select

Review

The Chinese Preposition Phrase Recognition Based on Simple Noun Phrase

SANG Leyuan, HUANG Degen

2015, 29(6): 8-12.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a new approach integrating simple noun phrase information into preposition phrase recognition. We recognize simple noun phrases through basic CRF model, and filter the phrases with conversion rules in order to adapt to the inner phrase patterns in the preposition phrases. Then we utilize the simple noun phrases to merge fragmental participles into a complete phrase in our corpus. Finally, we recognize the preposition phrases through multilayer CRFs, and use rules to correct the result. The optimized model performs 1.03 point higher than the current best model yielding 93.02% precision , 92.95% recall, and 92.99%, F-measure.
Key words simple noun phrase recognition;CRF;participle fusion

Select

Review

Feature Structure for Relationship of Chinese Complex Sentence

FENG Wenhe

2015, 29(6): 13-22.

Abstract ( ) PDF ( )

Knowledge map

Save

Complex sentence relationship analysis is usually based on classification. Due to lack of a unified logic, it is faced with a lot of divergences. This paper proposes a feature structure as to describe the complex sentence relationship. The feature structure of complex sentence relationship is a tuple of [Feature Value]. This paper presents a preliminary set of feature structure for Chinese complex sentence, and demonstrates them in some specific applications. Compared with classification mechanism, feature structure analysis is reflective, and its determination is accurate and easy, which is promising in resource construction and computation research for deep semantic analysis of complex sentence.
Key words complex sentence relationship; feature structure; semantic analysis

Select

Review

An Unsupervised Word Sense Disambiguation Method Based on #br# Sememe Vector in HowNet

TANG Gongbo，YU Dong,XUN Endong

2015, 29(6): 23-29.

Abstract ( ) PDF ( )

Knowledge map

Save

Word sense disambiguation (WSD) is a classical issues in nature language processing. In this paper, we trained a language model with the sememe information in HowNet that can represent word semantic, so as to learn the semantic features of words automatically and improve the efficiency of feature learning. Then, we represent words by vectors of sememes. Meanwhile, the contexts of the polysemes is used as features. And then we disambiguate the polysemant by computing the vectors’ cosine similarity between polysemes and feature. We choose SENSEVAL-3 as test set, and achieve 37.7% in precision, which is better than other unsupervised method in the same test data.
Key words word embedding; HowNet; WSD; unsupervised methods

Select

Review

Study on Semantic Sentence Patterns of Chinese Pivotal Sentence
Based on Semantic Dependency Graph

ZHENG Lijuan, SHAO Yanqiu

2015, 29(6): 30-37.

Abstract ( ) PDF ( )

Knowledge map

Save

Semantic analysis of sentences is essential to language study, and it is also the major bottleneck restricting the large-scale application of language information technology at present. Based on the study of deep semantics analysis methods, we propose a new semantic analysis method-Semantic Dependency Graph, and construct a corpus consisting of 30,000 sentences. Furthermore, we make a study on the semantic sentence patterns of pure pivotal sentences in the corpus, and try to construct the system of semantic sentence patterns based on semantic dependency graph. We also summarize corresponding relations between sentence patterns and semantic sentence patterns to provide the automatic semantic parsing system with a corresponding knowledge base.
Key words semantic sentence patterns; semantic analysis; semantic dependency graph; pivotal sentence

Select

Review

Semantic Construction of Adjective-Noun Compounds in Mandarin:
Based on Qualia Structure and Conceptual Blending Theory

ZHANG Nianxin, SONG Zuoyan

2015, 29(6): 38-45.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper analyzes the qualia modification relationships of disyllabic adjective-noun compounds in Mandarin both quantitatively and qualitatively. It reveals that an adjective morpheme selectively constraints the qualia roles of the noun morpheme. Generally, when the adjective morpheme modifies the formal role or the constitutive role of the noun morpheme, a noun needs to be added in the process of meaning construction. When the adjective morpheme modifies the agentive role, the telic role or the conventionalized attribute, a verb needs to be added. Furthermore, both qualia structure and conceptual blending contribute to the meaning construction of adjective-noun compounds. If an adjective morpheme activates more than one qualia roles or qualia values, there will be polysemy or ambiguity.
Key words adjective-noun compound; meaning construction; generative lexicon theory; qualia structure; conceptual blending theory

Select

Review

Mathematical Modeling in Language Networks Research#br# ——From Complex Networks to Social Networks and Language Networks

ZHAO Yiyi，LIU Haitao

2015, 29(6): 46-53.

Abstract ( ) PDF ( )

Knowledge map

Save

Networks technology provides a new perspective for linguistics in the age of big data. Network method applied in language networks is to explore the structure of the law and the evolution of language network functions. This article reviews the development of complex network based on Graph Theory and the primary mathematical modeling of social networks, language networks, aiming to strip personality traits of language networks out from the characteristics of complex networks, and giving more references for multi-level language networks studies.
Key words language networks; network technology; network evolution; complex network characteristics; graph theory

Select

Review

A Study on the Construction of Grammar Knowledge-Base
and the Annotation of Grammar Points for TCSL

TAN Xiaoping, YANG Lijiao, SU Jingjie

2015, 29(6): 54-61.

Abstract ( ) PDF ( )

Knowledge map

Save

Grammar is the key and difficult issue in TCSL. However, the knowledge-base and corpus for grammar teaching in TCSL are few, which cannot meet the demands of the development of TCSL. This paper proposes the grammar description framework for TCSL basing on the three plane theory and the teaching grammar theory. It completes a grammar knowledge-base with 121 grammar points. Then, this paper annotates the grammar points in 95 592 sentences, covering 580 basic forms and 233 semantic categories. Finally, this paper discusses the application of the knowledge base and corpus in TCSL.
Key words grammar points; knowledge base; annotation; corpus; TCSL

Select

Review

Research and Construction of a Topic Corpus for Teaching Chinese as a Second Language

HU Renfen, ZHU Qi, YANG Lijiao

2015, 29(6): 62-68.

Abstract ( ) PDF ( )

Knowledge map

Save

In the area of teaching Chinese as a second language, each text in the textbook has a specific topic. Topic represents the core contents of each lesson, and has close relationship with other linguistic knowledge such as vocabulary and syntax. This paper introduces a hierarchical topic bank with 4 level-1 topics, 23 level-2 topics and 246 level-3 topics. The authors manually labeled 5 457 texts from 197 classical Chinese textbooks based on the topic bank and built a topic corpus including over 120 million sentences. In order to offer comprehensive reference on topic information, syntactic constructions and HSK word level information are also extracted as supplement knowledge for topic labeling.
Key words Chinese as a second language; topic; corpus

Select

Review

Vietnamese Dependency Treebank Construction Via Chinese-Vietnamese Bilingual Corpus

LI Fajie，YU Zhengtao，GUO Jianyi，LI Ying，ZHOU Lanjiang

2015, 29(6): 69-74.

Abstract ( ) PDF ( )

Knowledge map

Save

To leverage the rich and mature Chinese corpus for Vietnamese dependency treebank, this paper presents an approach to Vietnamese Dependency Treebank construction via Chinese-Vietnamese bilingual corpus with word alignments. Based on the word alignment information, the Chinese dependency parsing is mapped into Vietnamese Dependency structure. Experimental results show that this approach can simplify the process of manual collection and annotation of Vietnamese Treebank, also can save manpower and time building the Treebank. Experimental results show that the accuracy of this method compared to machine learning methods has improved significantly.
Key words vietnamese dependency treebank;chinese dependency parsing;word alignment

Select

Review

Automatic Expansion of Domain-Specific Sentiment Lexicon for Chinese

SONG Jiaying, HE Yu, FU Guohong

2015, 29(6): 75-82.

Abstract ( ) PDF ( )

Knowledge map

Save

In this paper we incorporate opinion element normalization with the PolarityRank algorithm and thus propose a semi-supervised approach to Chinese domain-specific sentiment lexicon expansion. We first extract a set of attribution-evaluation pairs from product reviews. In order to reduce the complexity and noises in sentiment lexicon expansion, we exploit Jaccard coefficient and rules to normalize the extracted product attributions and their relevant evaluations, respectively. Finally, we modify the PolarityRank algorithm to automatically recognize domain-specific dynamic polar words that are out of the original sentiment lexicon. Experimental results over product reviews in car and mobile-phone domains show that using the expanded domain-specific dynamic polar words helps improve polarity classification performance.
Key words sentiment analysis; sentiment lexicon expansion; polarityRank; opinion element normalization

Select

Review

The Research and Construction of Chinese Hedge Corpus

ZHOU Huiwei,YANG Huan,ZHANG Jing,KANG Shiyong,HUANG Degen

2015, 29(6): 83-89.

Abstract ( ) PDF ( )

Knowledge map

Save

Hedge is usually used to express uncertainty and possibility. When authors cannot back up their statements, they usually use hedge to express uncertain information. To avoid extracting uncertain statements as factual information, uncertain information should be distinguished from factual information. However, inadequate Chinese hedge corpus limited the research of Chinese hedge. This paper discusses the categorization of Chinese hedge, introduces the design and construction of a 24,000-sentence Chinese hedge corpus in the biomedical and Wikipedia domains. We calculate agreement rates for the corpus and reveal the domain and genre dependency of hedges. The construction of the corpus is of great significance in the research of Chinese hedge detection and Chinese information extraction. Meanwhile, the resource provides a great support for linguists to study the semantic hedge and the pragmatic hedge.
Key words Chinese hedge; categorization; corpus; agreement analysis

Select

Review

Chinese-Vietnamese Bilingual News Event Storyline Analysis
Based on Words Co-occurrence Distribution

GAO Shengxiang, YU Zhengtao, LONG Wenxu, DING Wei, YAN Chunting

2015, 29(6): 90-97.

Abstract ( ) PDF ( )

Knowledge map

Save

Aiming at Chinese-Vietnamese bilingual news event storyline analysis, a generative model for event storyline is proposed based on global/local word pairs’ co-occurrence distribution. Firstly, the detected news topic word distribution was used as global words to characterize a global event, Then time, person, place and other event elements in the news segment divided by certain time granularity are used as local words. The are co-occurrence of global and local words is analyzed and used as supervised information, with RCRP algorithm and bilingual aligned words together, which are integrated into a bilingual topic model to get sub-topic distribution under corresponding time slice. Finally, by the sub-topic distribution representing the developing process of an event, a generative model to storyline was constructed. On Chinese-Vietnamese mixed news set crawled from the internet, the comparative experiments of storyline generation are conducted, proving that the proposed bilingual news storyline is model got better effect than the other methods.
Key words Chinese-Vietnamese; news event storyline; global/local co-occurrence words; sub-topic distribution; bilingual topic model

Select

Review

Frame-Based Discourse Structure Modeling andRelation Recognition for Chinese Sentence

LV Guoying,SU Na,LI Ru,WANG Zhiqiang,CHAI Qinghua

2015, 29(6): 98-109.

Abstract ( ) PDF ( )

Knowledge map

Save

Frame semantics is introduced to the research of Chinese discourse analysis which includes three subtasks discourse segmentation, discourse structure modeling and discourse relation recognition. First, the Chinese discourse coherence framework and a corresponding corpus is built based on frame semantics. Then two kinds of maximum entropy classifiers are applied to recognize the relation between discourse units and the class of discourse relation based on lexical features, dependency parser features, syntactic parser features, target features and frame sematic features. Finally, we use probability of the relation existence between discourse units to generate the discourse structure by greedy bottom-up method. Experimental results show that frame sematic can segment discourse units effectively and frame sematic feature can improve the performance of discourse structure construction and discourse relation recognition.
Key words Discourse units; Discourse Structure; Discourse Relation; Greedy Bottom-up Method

Select

Review

Implicit Discourse Relation Recognition for Imbalanced Data

ZHU Shanshan, HONG Yu, DING Siyuan, YAO Jianmin, ZHU Qiaoming

2015, 29(6): 110-118.

Abstract ( ) PDF ( )

Knowledge map

Save

Implicit discourse relation recognition is an important subtask in the discourse analysis field. Most existing studies assume the balance between the numbers of positive and negative samples, and employ random under-sampling method to keep the training data well balanced. However, the training data has imbalanced distribution in reality which affect the recognition performance of the implicit discourse relation. To solve this problem, we propose a novel implicit discourse relation recognition method based on the frame semantic vectors. Firstly, we represent the argument as a frame semantic vector using the FrameNet resource, and then mine a number of effective discourse relation samples from the external data resources based on this new representation. Finally, we add the mined samples into the origin training data sets and perform experiment on this extended data sets. Evaluation on the Penn Discourse Treebank (PDTB) show that the proposed method perform better than the current mainstream imbalanced classification methods.
Key words implicit discourse recognition; imbalanced data; frame semantic vectors

Select

Review

Recognizing Textual Entailment Based on Knowledge Topic Models

REN Han,SHENG Yaqi,FENG Wenhe,LIU Maofu

2015, 29(6): 119-126.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper analyzes the defects in current entailment recognition approaches based on classification strategy and proposes a novel approach to recognizing textual entailment based on a knowledge topic model. The assumption in this approach is, if two texts have an entailment relation, they should share a same or similar topic distribution. The approach builds an LDA model to estimate semantic similarities between each text and hypothesis, which provides the evidences for judging entailment relation. We also employ three knowledge bases to improve the precision of Gibbs sampling. Experiments show that knowledge topic model improves the performance of textual entailment recognition systems.
Key words recognizing textual entailment; topic model; entailment classification;inference knowledge

Select

Review

A Tentative Study on Statistical and Rule Based Information Extraction #br# from Ancient Chinese

YU Ningyi,AO Gaoqi,UN Endong

2015, 29(6): 127-134.

Abstract ( ) PDF ( )

Knowledge map

Save

The information extraction from ancient Chinese benefits language monitoring and corpus construction. This paper regards the ancient Chinese tagging in mixed corpus as a task of short text classification, and applies both rule methods and statistical methods. For rule based methods, the paper considers the effect from function words and constructions in ancient Chinese. For statistical methods, we conduct experiments on N-gram, Naive Bayes, Maximum Entropy, and Decision Tree. Experiments indicate that the unigram model over performs others in F value of 0.98. The research in this paper also provides evidence for the conclusion on Chinese evolution as a continuum.
Key words ancient Chinese tagging; text classification; rule based model; statistic based model

Select

Review

A Hypergraph Based Approach to Collaborative Text
Summarization and Keyword Extraction

MO Peng, HU Po, HUANG Xiangji, HE Tingting

2015, 29(6): 135-140.

Abstract ( ) PDF ( )

Knowledge map

Save

Text summarization and keyword extraction are two important research topics in Natural Language Processing (NLP), and they both generate concise information to describe the gist of text. Although these two tasks have similar objective, they are usually studied independently and their association is less considered. Based on the graph-based ranking methods, some collaborative extraction methods have been proposed, capturing the associations between sentences, between words and between the sentence and the word. Though they generate both text summary and keywords in an iterative reinforced framework, most existing models are limited to express various kinds of binary relations between sentences and words, ignoring a number of potential important high-order relationships among different text units. In this paper, we propose a new collaborative extraction method based on hypergraph. In this method, sentences are modeled as hyperedges and words are modeled as vertices to build a hypergraph, and then the summary and keywords are generated by taking advantage of higher order information from sentences and words under the unified hypergraph. Experiments on the Weibo-oriented Chinese news summarization task in NLPCC 2015 demonstrate that the proposed method is feasible and effective.
Key words hypergraph;document Summarization;keyword extraction;collaborative extraction

Select

Review

Unsupervised Text Feature Extraction Based on Natural Annotation and Latent Topic Model

RAO Gaoqi,YU Dong,XUN Endong

2015, 29(6): 141-149.

Abstract ( ) PDF ( )

Knowledge map

Save

Text features are often shown by its terms and phrases. Their unsupervised extraction can support various natural language processing. We propose a “Cluster-Verification” method to gain the lexicon from raw corpus, by combining latent topic model and natural annotation. Topic modeling is used to cluster strings, while we filter and optimize its result by natural annotations in raw corpus. High accuracy is found in the lexicon we gained, as well as good performance on describing domains and writing styles of the texts. Experiments on 6 kinds of domain corpora showed its promising effect on classifying their domains or writing styles.
Key words natural annotation; natural chunk; latent topic model; domain feature; stylistic features

Select

Review

Microblog Forwarding Prediction Based on Hot Topics

CHEN Jiang,LIU Wei,CHAO Wenhan,WANG Lihong

2015, 29(6): 150-158.

Abstract ( ) PDF ( )

Knowledge map

Save

Microblog forwarding is an important way to the information dissemination, and microblog forwarding prediction is of great value in the analysis of microblog influence and microblog topic analysis. Existing methods of microblog forwarding prediction mostly focus on microblog and user attributes. In this paper, a microblog forwarding prediction method based on hot topics is proposed. We quantitatively analyze the impact of hot content and transmission tendency on users’ forwarding behavior, and then introduc features concerned with hot topics such as forwarding interest, forwarding activity and behavior pattern. Finally, we establish the hot topic oriented microblog forwarding prediction model based on the classification algorithm. Our experimental results on real data show that the accuracy of this method is 96.6%, and the max improvement of is 12.14%.
Key words microblog forward; forwarding prediction; hot topic

Select

Review

Convolutional Neural Networks for Chinese Micro-blog Sentiment Analysis

LIU Longfei, YANG Liang, ZHANG Shaowu, LIN Hongfei

2015, 29(6): 159-165.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese micro-blog sentiment analysis aims to discover the user attitude towards hot events. This task is challenged by immense noises, rich new words, numerous abbreviations, vigorous collocation, together with the limited contextual information provided in the short texts. This paper explores the feasibility of performing Chinese micro-blog sentiment analysis by convolutional neural networks. To avoid task-specific features, character level embedding and word level embedding are adopted for convolutional neural networks(CNN). On the COAE 4th task corpus, the character level CNN achieves a sentiment prediction (in both binary positive/negative classification) accuracy of 95.42%, slightly better than the word level CNN yielding 94.65% accuracy. The results show that the convolutional neural networks model is promising in Chinese micro-blog sentiment analysis.
Key words deep learning;sentiment analysis;convolutional neural networks;word embedding

Select

Review

Building Social Emotional Lexicons for Emotional Analysis on Microblog

JIANG Shengyi，HUANG Weijian，CAI Maoli，WANG Lianxi

2015, 29(6): 166-171.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper aims to explore a method to build social emotional lexicons from microblog and apply it to analyze social emotions in social public events. First, the small-scale standard emotional lexicons are manually collected as the basic emotional lexicon. Then, word2vec, a tool based on deep learning, is used to conduct incremental learning method on the corpus from social events on microblogs to expand the basic emotional lexicon. The final emotional lexicon is filtered by HowNet and experts. In the following, the paper compares the results of emotional analysis based on the generated emotional lexicon with those based on SVM classification, demonstrating 13.9% increase in average precision and 1.5% increase in recall. Finally, the proposed methods are verified according to emotional analysis on different social events with the generated emotional lexicon.
Key words microblogging; social emotions; lexicon; emotional analysis

Select

Review

Combining Convolutional Neural Networks and Word Sentiment #br# Sequence Features for Chinese Text Sentiment Analysis

CHEN Zhao，XU Ruifeng，GUI Lin，LU Qin

2015, 29(6): 172-178.

Abstract ( ) PDF ( )

Knowledge map

Save

Recently, the classification approach based on word embedding and convolutional neural networks achieved good results in sentiment classification task. This approach is mainly based on the contextual features of the word embeddings without the consideration of the polarity of the words. Meanwhile, this approach lacks of the use of manually compiled sentiment lexicon resources. To address these problems, this paper proposes a novel sentiment classification method which incorporates existing sentiment lexicon and convolutional neural networks. In this work, the words in text are abstractly represented by using existing sentiment words. The convolutional neural networks are used to extract sequence features from the abstracted word embeddings. Finally, the sequence features are applied to sentiment classification. The evaluations on Chinese Opinion Analysis Evaluation 2014 dataset show that our proposed approach outperforms the convolutional neural networks model with word embedding features and Nave Bayes Support Vector Machines.
Key words convolutional neural networks; sentiment analysis; word sentiment sequence features

Select

Review

Tag Recommendation with Summary and Comment Information

CHU Xiaomin, WANG Zhongqing, ZHU Qiaoming, ZHOU Guodong

2015, 29(6): 179-184.

Abstract ( ) PDF ( )

Knowledge map

Save

Social tags are important styles of information organizing on the Web 2.0 era. Tag recommendation can help users collect, search and share online resources effectively. The previous approaches are focused on using single types of textual information, e.g. summary of a movie. But in practice there are various types of textual information that can be used for tag recommendation. For example, a movie contains both summary and comment information. Different types of information reflect different aspects of the movie. Thus we propose a novel approach to combine both summary and comment information to recommend tags. Furthermore, we use different ensemble learning approaches to incorporate the above information. The experimental results show that our proposed approach using different types of information outperform using single types of textual information in the tag recommendation tasks.
Key words natural language processing; social tags; ensemble learning

Select

Review

A New Chinese Subjective Sentences Recognition Method
Based on Word Co-occurrence Relationship Graphic Model

WANG Mingwen, FU Cuiqin, XU Fan, HONG Huan

2015, 29(6): 185-192.

Abstract ( ) PDF ( )

Knowledge map

Save

Different from the traditional term independence assumption-based bag-of-words model, we present a new word co-occurrence relationship-based graphic model. Our model describes the distribution difference among the terms within both subjective and non-subjective sentences sets via the term co-occurrence and syntactic information, also integrates an indegree-based term weighting calculation method. Evaluation on the benchmark dataset shows the importance of the term co-occurrence graphic model. It also shows that our model significantly outperforms the bag-of-words model currently in the subjective sentence identification field.
Key words word co-occurrence; graphic model; subjective sentence identification; feature value; supervised learning

Select

Review

Detection of Adverse Drug Reactions Based on Comment Mining

ZHAO Mingzhen, CHENG Liangxi, LIN Hongfei

2015, 29(6): 193-202.

Abstract ( ) PDF ( )

Knowledge map

Save

When mining adverse drug reactions (ADRs) from the user comments on healthcare social networks, it is very important to recognize novel ADR expressions from comments and normalize them, since people probably adopt different expressions to describe adverse reactions and new adverse reactions may emerge with the listing of new drugs as well as the diversity of drug users. This paper utilizes Conditional Random Field (CRF) model to recognize adverse reaction entities, and proposes a normalization method applied to the recognized entities. The effectiveness of this mining method is verified by comparing the mined results of known ADRs with database records, and a list of potential ADRs sorted by occurrence frequency in comments is obtained. Experimental results indicate that CRF model is capable of identifying both known and novel adverse reaction entities, and the standardization aggregates and merges the entities, which benefits the ADR discovery.
Key words adverse drug reaction; user comment; text mining; entity normalization

Select

Review

TIP-LAS: An Open Source Toolkit for Tibetan Word Segmentationand POS Tagging

LI Yachao, JIANG Jing, JIA Yangji, YU Hongzhi

2015, 29(6): 203-207.

Abstract ( ) PDF ( )

Knowledge map

Save

TIP-LAS is an open source toolkit for Tibetan segmentation and POS tagging. The toolkit implements the Tibetan segmentation system based on syllable tagging by the CRF model, and integrates the maximum entropy model with syllables features for Tibetan POS tagging. In the experiments, this system achieves good results. The source code is shared in the Internet, together with the experimental corpus.
Key words Tibetan; word segmentation; part of speech tagging; conditional random fields; maximum entropy

Select

Review

Morphological Analysis Based Noun Stem Identification for Modern Uyghur

Azragul,Alim Murat, Yusup Abaydula

2015, 29(6): 208-212.

Abstract ( ) PDF ( )

Knowledge map

Save

Modern Uyghur noun stem identification is a fundamental issue in the field of natural language processing. The morphological analysis is first introduced, especially on its role in identifying the POS of words. Then this paper describes the POS scheme in Uyghur, as well as the morphological characteristics of Uyghur nouns, suffix ambiguity and the disambiguation rules. The algorithm of new nouns identification in modern Uyghur language is proposed, including feature selection (features within and between words) and parameter estimation. The experiment is carried on the corpus of Uyghur physical textbooks in junior and senior middle schools.
Key words modern Uyghur; morphological analysis; noun stems recognition

Select

Review

Tibetan Automatic Word Segmentation Based on #br# Conditional Random Fields and Knowledge Fusion

Luobsang Karten，YANG Yuanyuan，ZHAO Xiaobing

2015, 29(6): 213-219.

Abstract ( ) PDF ( )

Knowledge map

Save

Tibetan word segmentation is one essential task in Tibetan language processing. In this paper, a CRF module is trained on 35.1M Tibetan corpus with manual annotation. The CRF segmentation results is processed by rules for the errors such as segmentation errors of non-Tibetan characters, recognition error of Tibetan adhesion words, segmentation errors of stop words and unregistered words. An open test demonstrate an accuracy of 96.11%, recall rate of 96.03%, and F score of 96.06%.
Key words Tibetan; word segmentation;CRFs;knowledge fusion

Select

Review

Tibetan Person Attribute Extraction Based on SVM and Pattern

ZHU Zhen，SUN Yuan

2015, 29(6): 220-227.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes an SVM and pattern based approach to Tibetan person attribute extraction. The pattern system is built with language rules on Tibetan language features with clear semantic information, such as case-auxiliary words, particular verb and etc. Then, a machine learning approach via SVM is introduced to build a a hierarchy classifier strategy. Experiment results indicate a significant improvement in person attributes extraction.
Key words person attributes extraction; tibetan language processing; SVM; hierarchy classifier

Please choose a citation manager

Content to export

2015 Volume 29 Issue 6 Published: 15 December 2015