2013 Volume 27 Issue 6 Published: 16 December 2013
  

  • Select all
    |
    Review
  • Review
    CAO Ziqiang, LI Sujian
    2013, 27(6): 1-6.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper explores Chinese word segmentation without training data, which greatly benefits the foundation of language-independent word segmentation system. Mutual information and HDP are both widely used methods for unsupervised segmentation task. We combine these two models and improve the sampling algorithm. Without regard to punctuations, the F-scores of two test corpus with different sizes are 0.693 and 0.741. Compared to HDP baseline, the scores rise 5.8% and 3.9%, respectively. Finally, our model is applied to semi-supervised word segmentation. The F-score is 2.6% larger than the common supervised CRF model.
    Key wordsHDP; mutual information; unsupervised word segmentation
  • Review
    LAU Kam tang1, 2, SONG Yan1, XIA Fei3
    2013, 27(6): 6-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we present a segmented and part-of-speech (POS) tagged Archaic Chinese corpus along with its construction process, which is performed by automatic segmentation and tagging with manual correction as post-processing. We use both Modern and Archaic Chinese labeled data for training word segmenter and POS tagger, which are further improved by domain adaptation techniques, as well as by adding linguistic and morphological features derived from the characteristics of Archaic Chinese language. The experimental results showed the effectiveness of our approach. In particular, the domain adaptation techniques and the added features significantly improve POS tagging performance. During our manual correction, we categorize the errors resulted from the automatic segmentation and POS tagging process, and investigate the sources of those errors. Finally, we give the statistics of the resulted corpus on the distributions of words and POS tags. Our work is a preliminary study that could be easily extended to annotating other Archaic Chinese text, and the resulted corpus is a valuable resource for research on Archaic Chinese language.
    Key wordsArchaic Chinese corpus; word segmentation; Part-of-speech Tagging; domain adaptation
  • Review
    QIAN Xiaofei1, HOU Min2
    2013, 27(6): 16-23.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposed a classifier ensemble method based on the language evaluation, and fused the MNP recognition results of SVMs and cascade CRFs based on reduction method, using the automatically obtained collocations and the manual assess rules. It then further targeted recognized the error-prone structures of the classifiers based on deterministic rules. The methods improve the recognition ability of boundary ambiguities of continuous verbs and prepositions as well as continuous nouns. The experiment is successful with a precision rate of 89.30% and a recall rate of 89.62%, especially it improves F1-score of multi-words MNPs by 0.75% in contrast with the reduction method.
    Key wordsmaximal noun phrase recognition; language knowledge assess; classifier ensemble; rule
  • Review
    YUAN Yulin
    2013, 27(6): 23-31.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses the construction of a practical Chinese semantic knowledge system and a corresponding database for the purpose of computing meaning in Chinese text. The 4-step working procedure is proposed as follows(1) Under the principles of Generative Lexicon Theory and Argument Structure Theory, the qualia structure of nouns and the argument structure of verbs and adjectives are described, including both the set of qualia roles or semantic roles and the syntactic constructions constituted by the nouns, verbs and adjectives. (2) The semantic orientation and sentiment polarity of the nouns, verbs and adjectives are indicated along a 5-point scale. (3) The inference relation of the qualia roles and semantic roles of related nouns, verbs and adjectives is revealed, which results in a lexical network. (4) The entity reference, conception relation and sentiment polarity of the nouns, verbs and adjectives are then integrated into a multi-level semantic knowledge database. Finally, a case study of computing meaning with the help of the multi-level semantic knowledge is presented.
    Key wordssemantic description system; semantic knowledge database; qualia structure; argument structure; sentiment polarity; semantic correlation
  • Review
    WAN Fuqiang, WU Yunfang
    2013, 27(6): 31-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Lexical semantic relatedness plays an important role in natural language processing, such as information retrieval, word sense disambiguation and automatic text summarization and spelling correction, etc. In this paper, we employ Wikipedia-based Explicit Semantic Analysis to compute semantic relatedness between Chinese words. Based on Chinese Wikipedia, a word is represented as weighted vectors of concepts. Then,computing the semantic relatedness of words amounts to comparing the corresponding concept vectors. Furthermore, weadd the priori probability factor of concept and use the linking information among the Wikipedia pages to optimize the concept vectors. The experimental results show that the Spearmans rank correlation coefficient between the computed relatedness and human judgments reaches 0.52, which significantly outperforms the baseline.
    Key wordssemantic relatedness; explicit semantic analysis; Chinese Wikipedia;priori probability; concept vectors
  • Review
    CAO Yuan, ZHU Qiaoming, LI Peifeng
    2013, 27(6): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    The factuality of an event is the degree of certainty to which an event is a factual one. In context, what expresses this attribute are the specific sentence structure and vocabularies. In this paper, we make the full study of the factors which influence Chinese event factuality, then present five kinds of factual related information of events and their annotation rules. Finally,we accomplish the annotation of the Movement event in the ACE 2005 Chinese Corpus and analyze the results, which is the foundational work of many information extraction applications.
    Key wordsfactuality; corpus; annotation
  • Review
    CHI Zhejie1,2, ZHANG Quan2
    2013, 27(6): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    For Hierarchical Network of Concepts theory, domain is one of the main factors in Sentences Group Unit. Domain determination is an important issue in Sentences Group Unit Extraction. To determine the domain, we proposed a method using domain concepts and concept association expressions, which counted frequencies, merged concepts and summarized concepts in concept primitive space. For politics, economics and military domain, the experimental results show high performance in present method, the F1 scores reach 90.61%, 90.83% and 90.99% respectively, which are 7.7%, 12.76% and 5.01% higher than the results with no concept association expressions. Finally, compared with the keywords-based method, the concept-primitives-based method shows high performance.
    Key wordsconcept primitives;concept association expressions;domain determination
  • Review
    ZHANG Muyu, SONG Yuan, QIN Bing, LIU Ting
    2013, 27(6): 51-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse Relation Recognition is one important part of discourse analysis. This paper focused on Chinese discourse relation recognition, including explicit discourse relation recognition and implicit discourse relation recognition. For explicit discourse relation recognition, we proposed a statistical method based on discourse connectives rules which got rather good results. For implicit discourse relation recognition, we combined lexical, syntactic and semantic features in a supervised model to classify implicit relations. The detail analysis and experiment results are useful and provide a baseline for future work on this task.
    Key wordsChinese discourse semantic analysis; explicit discourse relation recognition; implicit discourse relation recognition
  • Review
    XIONG Hao1,2, LIU Qun1, LV Yajuan1
    2013, 27(6): 58-69.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic Role Labeling(SRL) and Coreference Resolution(CR) play important role in natural language processing applications. In this paper, we propose 8 rules to jointly learn and inference two tasks using markov logic network. Experimental results on OntoNote5.0 show that joint learning with markov logic network significantly improve 1.6 points in term of F score both on SRL and CR towards single systems.
    Key wordssemantic role labeling; coreference resolution; markov logic network
  • Review
    XIAO Shan1, GUO Tingting2
    2013, 27(6): 69-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    Nature Language Processing (NLP) is one of the most important research parts of artificial intelligence. And the Words Semantic Knowledge Base (WSKB) construction is an important guarantee to make big progress on NLP. Nowadays, the domestic and abroad researches of network of words which based on Synset have several problemsLess stringent structure, rough semantic particle size, limited range of applications and so on. To build up a type of Multi-Dimension Word Net based on the detailed description of the features of concepts may solve these problems. This type of WSKB used the Synset-Lexeme Anamorphosis Method to analysis the relationship and distinctive features between basis lexeme and its concept anamorphoses. And it introduced the interactive speech act verbs based on characteristic sense analysis as examples, preliminary discuss the description of the lexical meaning structure and its synset construction rules.
    Key wordssynset; evaluated speech act verbs; synset-lexeme anamorphosis method
  • Review
    LI Shoushan1,2, LEE Sophia Yat Mei2, HUANG Chu-Ren2, SU Yan1
    2013, 27(6): 75-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently, sentiment analysis has become a hot research topic in the natural language processing (NLP) field as it is highly valuable for many practice usages and theory studies. One basic task in sentiment analysis, named the construction of sentiment lexicon, aims to classify one word into positive, neutral or negative according to its sentimental orientation. However, there are two major challenges1) Chinese words are very ambiguities, which makes it hard to compute the sentimental orientation of a word; 2) Given the related research on sentiment analysis, available resource for constructing Chinese sentiment lexicons remains few. Note that there are several corpus and lexicons in English sentiment analysis. In this study, we first use machine translation system with bilingual resources, i.e., English and Chinese information, then get the sentiment orientation of Chinese words by the label propagation algorithm. Experiment results across four domains demonstrate that the lexicon generated with our approach reach an excellent precision and could cover domain information effectively.
    Key wordssentiment analysis; bilingual; sentiment lexicon; label propagation algorithm
  • Review
    XU Ruifeng, ZOU Chengtian, ZHENG Yanzhen, XU Jun, GUI Lin, LIU Bin, WANG Xiaolong
    2013, 27(6): 82-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    Current emotion dictionary usually annotates the categories and strength of emotion words, but lack of the capacity to distinguish emotional expression and emotional cognitive results. Meanwhile, the direct annotation on the word entries leads to emotion annotation ambiguities caused by the word sense ambiguities. Based on the analysis on the generation and migration mechanisms of individual emotions, this papers proposes a text emotion computing framework based on "cognitive stimulation - reflective expression" mechanism. Under this framework, we explores the construction strategy of a new emotion dictionary based on analyze the function and characteristics of emotion words. Firstly, we introduce the part-of-speech and word sense information provided by HowNet for separating one word to multiple entries corresponding to different part-of-speech and word senses in order to reduce annotation ambiguity. Secondly, the emotion expression categories and emotion cognition categories corresponding to each word entry are distinguished. The emotion categories and their corresponding strength values are annotated from different aspects, respectively. Meanwhile, the types of emotion expression and emotion cognition are refined annotated, respectively. Finally, a preliminary new type of emotion dictionary is constructed with a clear framework, rich emotional knowledge and low ambiguity.
    Key wordsemotion dictionary; emotion cognition; emotion expression; word sense
  • Review
    LI Shoushan1,2, LEE Sophia Yat Mei2, LIU Huanhuan1, HUANG Chu-Ren2
    2013, 27(6): 90-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    Emotion classification is one basic task in emotion analysis, which has been a hot research issue in the community of Natural Language Processing. Precious studies often leverage the emotion keywords (e.g., happy, sad) to do emotion classification. However, there exists some text that includes no emotion keywords but does express emotions. We refer to the emotion expression without emotion keywords as implicit emotion expression. In this paper, we focus on the classification task on implicit emotion expression and propose a classification method with related events. We think that the related events are important indicates of the emotion categories. First, we collect the sentence groups that contains emotion keywords; Then, we delete the keywords and regard the context as describing the emotion related events. Third, we use the context as the feature source to perform emotion classification. Empirical studies demonstrate that using the context yields a nice performance for implicit emotion classification. This result provide a good basic for the studies on implicit emotion classification.
    Key wordsemotion related events; emotion classification; emotion keywords
  • Review
    WANG Zhihao,WANG Zhongqing,LI Shoushan,LI Peifeng,SHI Hanxiao
    2013, 27(6): 96-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    Feature selection aims to reduce the high-dimensional feature space so as to simplify the problem and improve the learning method. Existing studies have shown that feature selection is effective in reducing feature space in sentiment classification. In this paper, we focus on feature selection method. Different from all previous studies, we attempt to conduct the research on feature selection on semi-supervised sentiment classification. We propose a novel feature selection method based on bipartite graph which focuses on semi-supervised sentiment classification. First, we formulate the relations between documents and words with the help of bipartite graph model. Then, with a small amount of labeled data and the bipartite graph, a label propagation algorithm is applied to calculate the feature probabilities belonging to sentimental categories. Third, the features are then selected according the sentimental probabilities. The experimental results across multiple domains demonstrate that our feature selection method achieves much better performances than random feature selection method. Our approach is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance.
    Key wordssentiment classification; semi-supervised learning; label propagation; bipartite graph; feature selection
  • Review
    HOU Min, TENG Yonglin, CHEN Yuqi
    2013, 27(6): 103-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    The opinion phrases,as one of the opinion factors,is an important aspect of Chinese orientation analysis. The opinion phrases can be classified as 5 types,i.e. “opinion word + opinion word”,“modifier + opinion word”,“non-opinion word + opinion word”,“modifier + non-opinion word”,“non-opinion word + non-opinion word”. With each type,different orientation analysis strategy is applied on the basis of combination of applying phrase rules and opinion phrases lexicon. Phrase rules should be organized as specific rules and common rules. The establishment of opinion phrases lexicon should be obey the rule of minimum opinion factors. The experiment shows that the precision of orientation analysis is improved effectively with the applying of phrase rules and opinion phrases lexicon.
    Key wordsopinion phrases; Sentiment Analysis; opinion phrases lexicon; opinion phrase rules; rule of minimum opinion factors
  • Review
    ZHANG Chen, FENG Chong, LIU Quanchao, SHI Chao, HUANG Heyan, ZHOU Haiyun
    2013, 27(6): 110-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinions always carry important information of texts. Comparative sentence is a common way to express opinion. This paper described how to recognize comparative sentences from Chinese text documents by applying rule-based methods and statistical methods as well as analyze the performance of these methods. This method firstly normalized the corpus and its segmentation results, and then got the broad extraction results by using a lexicon-based method, sentence structure and dependent relationship analysis. Then a kind of CSR rule extraction algorithm was designed to extract the dependency relationship. The paper also used a CRF algorithm to identify entities and semantic roles. Finally, by using SVM classifier and choosing different feature dimensions the paper found the most optimum and effective features combination to finish the accurate extraction.
    Key wordscomparative sentence;rule;CRF;SVM
  • Review
    HENG Wei, YU Jia, LI Lei, LIU Yongbin
    2013, 27(6): 117-128.
    Abstract ( ) PDF ( ) Knowledge map Save
    The results of hLDA (hierarchical Latent Dirichlet Allocation) in the hierarchical topic modeling have been widely validated. In order to achieve semi-supervised or unsupervised learning, cross-validation or sampling super parameters are usually used to determine the true parameters. However, corpus features, modeling demand and some other factors are uncertain. Hence, parameter adjustment, modeling effectiveness and efficiency are difficulty to achieve in practical applications. This paper builds a unified analytical framework by combining Bayesian theory and boundary information, analyzes the key factors in its topic modeling, then gives a series of practical and effective modeling strategies and processes, and finally evaluates the modeling results with multi-document summary corpus from ACL MultiLing 2013.
    Key wordsHierarchical LDA; Hierarchical Topic Modeling; Unified Analytical Framework
  • Review
    WANG Maolin1, ZI Guangling1, XIONG Wei1, LIN Maocan2
    2013, 27(6): 128-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, based on a telephone conversation corpus, the pitch declination of sentence is done. It is found that pitch declination occurs in most cases, which is due to physiological reason, and there is demarcative function as well. In some cases declination does not occur, and this is related to semantic strength, focus and tone. The pitch representation of statement and question is analyzed, and it is found that compared to statement, the pitch range is great for question. The pitch drop is the least between the final two syllables for yes-no question without final particle.
    Key wordsspontaneous speech; pitch; declination
  • Review
    FU Xiaoyin, WEI Wei, LU Shixiang, XU Bo
    2013, 27(6): 134-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an effective method for filtering and optimizing hierarchical phrase-based (HPB) model. After obtaining the original HPB rules with traditional training method, we generate the bilingual derivation trees that represent source and target sentences with forced alignment, and then extract the HPB rules from derivation trees. At last, we re-estimated the probabilities of HPB rules with the extracted rules. This method does not need any linguistic knowledge, and it is suitable for large-scale training corpus. In the large scale Chinese-English translation tasks, our proposed method filters about 50% of the original HPB rules and improves the translation performance ranging from 0.8~1.2 BLEU on the test sets, comparing to the traditional training method.
    Key wordsstatistical machine translation;hierarchical phrase-based model;forced alignment;model training
  • Review
    YIN Yue, ZHANG Yujie, XU Jinan
    2013, 27(6): 139-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    In statistical machine translation system, automatically extracted phrase table inevitably contains a large number of errors and redundant phrase pairs, which causes excessive waste of time and space in decoding and affects translation quality. In order to solve this problem, we propose a method for filtering phrase table in which virtual context is introduced to calculate an incremental quantity in score of phrase pair from language model. By considering the maximum and minimum incremental quantity in score from the virtual context, we design a filtering strategy by re-ranking phrase pairs. We conducted experiments on NTCIR-9 Chinese-English data to verify the method. The experimental results show that when the size of phrase table was reduced to 47% of the original, the translation quality was improved slightly; when the size was reduced to 30% of the original, only slight decline occurred in translation quality. The experimental results indicate that this method can effectively filter out the redundant phrase pairs of the phrase table.
    Key wordsphrase-based statistical machine translation, filter phrase table, virtual context
  • Review
    WANG Xing1,TU Zhaopeng2, 3,XIE Jun2,LV Yajuan2,YAO Jianmin1
    2013, 27(6): 144-151.
    Abstract ( ) PDF ( ) Knowledge map Save
    Large-scale bilingual corpus is a fundamental resource to build a high-quality statistical machine translation system. However, there are usually a large number of noises in the corpus, which would affect the performance of translation system. Therefore, it is essential to filter noisy sentences. In this paper, we propose a classification based selection approach to distinguish high-quality bilingual sentences from the noisy ones. We first exploit several metrics to find the best and worst sentences in the corpus. Then we classify the rest sentences with the classifier, which is trained with more features on these sentences. Experimental results show that our approach not only eliminates 40% less promising sentences, but also significantly improves translation performance by 0.87 BLEU points over using all sentences.
    Key wordsstatistical machine translation; bilingual corpus selection
  • Review
    LI Li, LIU Zhiyuan, SUN Maosong
    2013, 27(6): 151-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatically extracting phrase-level paraphrases is an important research task in natural language processing (NLP), which has been applied in applications such as information retrieval, query answering and document classification. Moreover, technique patents, as an important carrier of human knowledge and technology, contain abundant information. Hence, automatically extracting phrase-level paraphrases from Chinese-English parallel patents has a positive effect on NLP tasks about technology. In this paper, we aim to extract phrase-level paraphrases from Chinese-English parallel patents automatically using method based on statistical machine translation, and use chunk parsing technology for paraphrase verification. Moreover, to dispose the errors caused by translation ambiguity and bad word alignment, we use distributional similarity to re-rank the extracted phrase-level paraphrases. In experiments, we find that the method based on statistical machine translation gets a precision of 43.20% on Chinese patents while 43.60% on English patents for Top-500 results. Meanwhile, after verification with chunk parsing, the precisions are raised to 75.50% and 52.40%, respectively. Moreover, the re-ranking based on distributional similarity also improves the performance significantly.
    Key wordsphrase-level paraphrase; statistical machine translation; chunk parsing; distributional similarity
  • Review
    FENG Wenhe
    2013, 27(6): 158-165.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse structure parallel corpus is a corpus annotated with parallel discourse structure information for bilingual text. This paper proposes such an alignment and annotation strategy, the structural and relational alignment, which is the theoretical basis of Chinese-English discourse structure parallel corpus. This strategy is applied to the corpus building process, including segmental, structural, relational, and central alignment, having achieved an operation mode of parallel corps along with alignment and annotation working together, as well unit alignment and structural alignment. The strategy with the help of corresponding annotation software and the solutions to the difficulties has been proved to be an effective operation mode for discourse structure parallel corpus.
    Key wordsparallel corpus; alignment; discourse structure
  • Review
    LI Lin1,2, LONG Congjun2,3, JIANG Di2
    2013, 27(6): 165-169.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan functional chunks describe a sentence skeleton, and they are the link between sentence structure and semantics. In this paper, we proposed primary functional chunks in Tibetan and a functional chunk tag system. Based on the theory, a functional chunk boundary detection algorithm was proposed. Experiments on a limited scale data suggest that the algorithm is capable of recognizing most boundaries and deserves to be studied deeply.
    Key wordsTibetan functional chunks; chunks boundary detection; CRFs
  • Review
    BAI Shuangcheng1,2,3, ZHANG Jinsong1, Husile2,3
    2013, 27(6): 169-175.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word coding, representing the mapping between a word and a series of keyboard inputs here, is very important for efficient Mongolian IME. Based on the criterions of candidate duplications and average word coding length, this paper made a comparison study on the efficiencies of 7 kinds of coding methods belonging to 3 classes, and proposed a new syllable based Fuzzy input method. Experimental results showed that the method is not only easy for users to memorize, but also very efficient to use.
    Key wordsMongolian; IME; composition string; fuzzy input
  • Review
    SU Chuanjie, HOU Hongxu1, YANG Ping1,2, YUN Huarui1
    2013, 27(6): 175-180.
    Abstract ( ) PDF ( ) Knowledge map Save
    In traditional Mongolian electronic textsencoded inUnicode, spelling errors are very common. The cost of correcting spelling errors artificially is extremely high. This paper proposed an automatic spellingcorrection method for traditional Mongolian based on statistical machine translation framework, and we regardspelling correction task as a translation work which translates the wrong words to the correct words. This paper used the improved phrase-based statistical machine translation model to build spelling correction model. We use this model tocorrect the rawtext. We used atest set whichcontained 1 026 correct words and 1 102 wrong words to test our method, Experimental results show that our method can correct spelling errors quickly and efficiently without special language knowledge. The percentage of correct words in ourproofreadtextcan reach to 97.55%.
    Key wordsMongolian; spelling check; spelling correction; machine translation
  • Review
    WANG Ling2, DAWA Yidemucao1,2, WU Shouer Silamu1,2
    2013, 27(6): 180-187.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, an investigation is done for the similarity between the same family and agglutinative languages (such as Altai family languages ,for example, Uyghur, Kazakh, Kyrgyz and Mongolian using different countries and areas ). Cosine similarity measure is used to calculate the similarity using the parallel texts and the acoustic features extracted from the same content speech sentences spoken by the different language speakers. Experimental results show that the transformation is more feasible by word to word units when learning the connection rule of a stem and an affix (function words) between languages by word level and common acoustic models. Thus, this avoids the uphill work of MT for the resource-deficient languages such as minority languages being used in the developing countries. Additionally, the costs can be reduced.
    Key wordssame family and agglutinative language; parallel text; acoustic and prosody parameters; F0; similarity
  • Review
    GAO Tingli1, TAO Jianhua1, DAI Hongliang2, LI Ya1
    2013, 27(6): 187-192.
    Abstract ( ) PDF ( ) Knowledge map Save
    Daiwen word segmentation is the basis for Daiwen information processing work. Its the basic work for Daiwen input method, Daiwen machine translation system development, daiwen text information extraction and other information processing words. Limited by Daiwen corpus technology, Daiwen natural language processing technology is relatively weak. This paper first analyzes the characteristics of Daiwen, and on this basis, build a Daiwen corpus, then, applied Chinese word segmentation method to Daiwen segmentation, combined with its own characteristics, Designed an Daiwen word segmentation system based on the sequence annotation. Through experiments, the segmentation system has reached a comprehensive appraisal 95.58%.
    Key wordsDaiwen; segmentation; CRF; absolute segmentation word
  • Review
    ZHAO Ziyu,XU Jin’an,ZHANG Yujie,LIU Jiangming
    2013, 27(6): 192-201.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the knowledge base we defined, this paper presents a Japanese time expression recognition method throughcombining rules setstrengthened by knowledge base with statistical model.According to the Timex2 standards’ granular classification on time, we progressivelyexpanded and reconstructed the knowledge base given the Japanese time characteristic, and then achieved rules set optimization and update, in order to increase recognition accuracy. Simultaneously, we fused CRF model to enhance the generalization ability of Japanese time expression recognition. Our experimental results show that the F1 value reaches0.8987 on open test.
    Key wordsknowledge base; rules set; statistical model
  • Review
    LIN Li
    2013, 27(6): 201-209.
    Abstract ( ) PDF ( ) Knowledge map Save
    Vietnam is an important neighboring country of China, the corresponding massive information processing has become increasingly necessary and important. By referring to the relevant studies and practices on the Frame Semantic annotation at home and abroad, we built a Vietnamese News corpus. On the basis of text segmentation and part of speech tagging, named entity tagging, we tried to build Vietnamese FrameNet and initially explored the application of the Frame semantic annotation in the Vietnamese news event extraction.
    Key wordsFrame semantic; annotation; Vietnamese; News