2013 Volume 27 Issue 1 Published: 15 February 2013
  

  • Select all
    |
    Review
  • Review
    SHI Jing1, WU Yunfang1, QIU Likun2, LV Xueqiang3
    2013, 27(1): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic acquisition of similar words is one of the most crucial problems in natural language processing tasks, e.g. the query extension in information retrieval, pattern identification in machine translation, parser analysis and WSD . This paper focuses on Chinese semantic similarity computing based on large corpus, investigating the computation of context feature weight, the vector similarity measures, the window context vs. the dependency context, and the newspaper corpus vs. web corpus. Our experiments show that, in the web corpus, using window-based context combined with PMI weights function, the cosine measures gets the best semantic similarity results.
    Key wordssemantic similarity; context; weight function; dependency relation
  • Review
    WANG Shi1, CAO Cungen1, PEI Yajun3, XIA Fei1,2
    2013, 27(1): 7-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    The word similarity measure plays a basic role in many NLP related applications. In this paper, we propose a novel and practical method for this purpose with acceptable precision. Guided by the classic distribution hypothesis that “similar words occur in similar contexts”, we suggest the collocations in two-word noun phrases can serve as better contexts than the adjacent words because the former are more semantic related. By using automatic built large-scale noun phrases, we firstly construct tf-idf weighted words vectors containing direct and indirect collocations, and then take their cosine distances as desired semantic similarities. In order to compare with related approaches, we manually design a benchmark test set. On the benchmark test set, the proposed method achieves the correlation coefficients of 0.703, 0.509, and 0.700 on nouns, verbs, and adjectives, respectively, at a coverage 100%.
    Key wordssemantic similarity, word collocation, similarity benchmark set
  • Review
    XU Hua, LIU Dandan, QIAN Longhua, ZHOU Guodong
    2013, 27(1): 15-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently context-based approach is a popular approach for constructing bilingual lexicons from comparable bilingual corpora. Specifically, the dependency context model extracts context features from a sentences dependency tree. This model improves the performance of the bilingual lexicon construction since dependency relationships can better capture the co-occurrence relationship between words. Following this line, this paper further proposes a dependency relationship mapping model, which constructs bilingual lexicon by mapping dependency context words, dependency relationship types and directions simultaneously. The experiments on the FBIS corpus show that, our approach significantly outperforms a state-of-the-art system in bilingual lexicon construction from both Chinese-English and English-Chinese. This justifies the effectiveness of our dependency relationship mapping model on bilingual lexicon construction.
    Key wordsbilingual lexicon construction;dependency context model;dependency relationship mapping
  • Review
    TANG Wei, HONG Yu, FENG Yanhui, YAO Jianmin, ZHU Qiaoming
    2013, 27(1): 21-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    If we represent the products as attributes and attribute values, it will improve the effectiveness of many applications, such as demand forecasting, product recommendations, and product supplier selection. In this paper, we propose a novel pattern based method to extract the “attribute-value” pair of product from structured or semi-structured Web pages. This approach contains four key components1) acquire domain-specific attributes from titles of Web pages in the same domain. 2) refine text nodes based on some default delimiters. 3) collect seed “attribute-value” pairs based on the domain-specific attributes. 4) construct high-quality patterns by combining page-specific layout information and character information. The experimental corpus is collected from two domainsdigital camera and mobile phone. Experiments show the proposed method can schieve 94.68% in precision and 90.57% in recall.
    Key wordsproduct “attribute-value” relation extraction; web data mining; template construction
  • Review
    XIAO Sheng1,2, HE Yanxiang1
    2013, 27(1): 30-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper develops an event type recognition approach based on hypergraph to avoid the impact of independence assumption in vector space model.The approach firstly represents the Multi-ordering relationship among event elements by an event hypergraph. Then the approach displays the attribute and its structure of one ( series of) event(s) in various dimensions by the event hypergraph model integrated by the type component and dimension component. Lastly the approach computes the event similarity by the attribute and its structure. The experimental results showed that this approach can increase the efficiency of recognition to 83.0%(F-score), which is better than SVM or ME based on vector space model.
    Key wordsevent extraction; event type recognition; hypergraph; directed hypergraph; event hypergraph model; event similarity
  • Review
    LI Peng1,2, WANG Bin1,JIN Wei3
    2013, 27(1): 39-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    User generated social annotations provide extra information for describing document contents and, intuitively, are useful for improving search effectiveness. This paper presents a novel retrieval method utilizing the categorization property of social annotations. Specifically, social annotations are modeled as categories to estimate a topic model, which is then used for the language model smoothing. This method can reduce the impact of annotation sparsity and effectively employ the implications of the annotations to improve the retrieval performance. Experiments are carried out on synthetic datasets which are constructed from the TREC evaluation conference. The results demonstrate the effectiveness of the proposed method, which significantly outperforms LDA based baseline and other social tag based retrieval method.
    Key wordssocial annotation; tag; language model; topic model
  • Review
    FU Xianghua, LIU Guo, GUO Yanyan, GUO Wubiao
    2013, 27(1): 47-56.
    Abstract ( ) PDF ( ) Knowledge map Save
    Weblog is an important media for people to express their personal opinions and sentiment, which generally involve several topics or implied public opinions. The existing sentiment analysis researches on these user generation content are mostly in document level instead of fine granalarities. This paper proposes a novel method based on LDA topic model and HowNet lexicon to determine the sentiment orientation of blogs with multi-aspect topics. The new method utilizes data corpus to train the LDA topic model at first. Then it identifies and segments topics with the trained topic model, which taking a slide window as the basic processing unit. After that, the topics of paragraphs can be identified. And then the method conducts the sentiment analysis on topic paragraphs with HowNet lexicon. The new method can help to simultaneous identify multi-aspect topics and the sentiment orientation of these topics. The experiment results show that this approach can not only obtain a good topic partitioning results, but also help to improve sentiment analysis accuracy.
    Key wordsmulti-aspect sentiment analysis; blog sentiment analysis; LDA topic model; HowNet lexicon
  • Review
    LIAO Xiangwen1, XU Hongbo2, SUN Le3, YAO Tianfang4
    2013, 27(1): 56-64.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion mining is a hot topic of nature langue processing. In order to promote Chinese opinion mining research, the Technical Committee of Information Retrieval in Chinese Information Processing of China holds the third Chinese Opinion Analysis Evaluation Conference (COAE2011).The conference focuses on the influence of domain and context on Chinese opinion analysis. The paper presents the construction of COAE2011 corpus and how the corpus works in the evaluationit firstly introduces the course of corpus construction, such as the distribution of domain and media; then it discusses in detail the tagging criterion and method of corpus; finally, the impact of domain and context on Chinese opinion mining is evaluated based on the results of submitted runs. The COAE2011 corpus provides strong support for Chinese opinion analysis.
    Key wordsChinese information processing; opinion analysis; opinion corpus; text coding initiative
  • Review
    DUAN Nan1, LI Mu2, ZHOU Ming1, 2
    2013, 27(1): 64-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a comparative analysis of various consensus decoding methods appeared in recent years for statistical machine translation (SMT). Based on different ways of using translation hypotheses generated by single or multiple SMT systems, we classify current consensus decoding methods into two categorieshypothesis reranking-based consensus decoding and hypothesis reconstruction-based consensus decoding. After reviewing the most representative work for each category, we perform Chinese-to-English machine translation experiments on large scale data sets to compare different methods listed in this paper. The future development prospects of consensus decoding is discussed as well.
    Key wordsnatural language processing; statistical machine translation; consensus decoding; minimum Bayes-risk decoding; system combination
  • Review
    LIU Yingying, LUO Senlin, FENG Yang, HAN Lei, CHEN Gong, WANG Qian
    2013, 27(1): 72-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentential semantic structure analysis is an important issue in Chinese semantic analysis. Based on the Modern Chinese Semantics, this paper establishes a hierarchical Chinese sentential semantic structure model, defines the standard and the tagset, and thus constructs a Chinese corpus of sentential senmantic structureBFS-CTC (Beijing Forest Studio - Chinese Tagged Corpus). All sentences in this corpus are tagged on the lexical, the syntactic and the whole sentential semantic structure levels, and it is easy to analyze the relation between syntax and semantics. The core of BFS-CTC is consists of four banksthe original sentence bank (OSB), the lexical tagged bank (LTB), the syntax tagged bank (STB) and the semantic structure tagged bank (SSTB). The more than 10,000 sentences in current version come from news texts, covering six major sentence types in Chinese.
    Key wordsnatural language processing; semantic analysis; sentential semantic structure; corpus
  • Review
    ZHAO Jianjun1,2, YANG Yufang2, LV Shinan3
    2013, 27(1): 81-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the annotation of the sentence focus in 30 narrative texts of mandarin Chinese by 20 annotators, a statistical analysisis conducted to explore the distribution pattern of sentence focus from the perspectives of lexicon and semantics. The result shows that focus words occupy about one fifth of the total number of the content words, the adjectives are more probable to become focus than other part-of- speech words. From the perspective of the semantic roles,,the patient has the highest probability to be focalized, followed by the peripheral argument, and the agent and predicate words come at last.
    Key wordsnarrative discourse; focus distribution; semantic roles
  • Review
    LI Lishuang, Liu Yang, HUANG Degen
    2013, 27(1): 86-93.
    Abstract ( ) PDF ( ) Knowledge map Save
    Protein-Protein Interaction(PPI)extraction is important in the field of biomedical information extraction for its high application value. This paper applies the support vector machine (SVM) to extract PPI, specifically, with an ensemble kernel combined with polynomial kernel and convolution tree kernel. To address the pruning of a complete syntax parsing tree which contains too much noise, we discuss the influence of different pruning strategies to the experimental results with the complete tree, minimum complete tree, the minimum tree and the shortest path enclosed tree, finding the last one to be the best choice. On the basis of the shortest path enclosed tree, we propose a dynamic extended tree with better results than other syntax parsing tree. Finally, we use the ensemble kernel to extract PPI on the AIMED corpora with 10-fold cross-validation, with the precision, recall and F-score reaching 82.40%, 51.30% and 63.23%, respectively.
    Key wordsPPI;SVM;convolution tree kernel;ensemble kernel;pruning strategies
  • Review
    CHENG Nanchang1,2, HOU Min3
    2013, 27(1): 93-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is fundamental to sort out the “Homonyms Glossary of Chinese Dialect” for dialect investigation. This paper describes the process of design and development of the “Homonyms Glossary of Chinese Dialect” automatic generation system. Once the sorting standard for the vowels, consonants and tones are set by the user, the sorting of input dialect glossary can be started. The sorting method used in our software is a quadruple loop corresponding to vowels, consonants, intonation and all words of the input glossary. And finally the vertical homonyms glossary is generated. Experimental evidence proves that the system is fully capable to meet the practical needs of dialect investigation.
    Key wordshomonyms glossary of Chinese dialect; vertically homonyms glossary; automatic generation; design principles
  • Review
    YAN Ke1, WEI Si2, DAI Lirong1
    2013, 27(1): 98-108.
    Abstract ( ) PDF ( ) Knowledge map Save
    Traditional approach uses only the standard-pronounced speech data to build acoustic models, which makes automatic pronunciation systems poor show for accented speech data since the training and test are mismatch. To deal with the problem, this paper presents a novel algorithm that utilizes both standard and accented speech data to optimize acoustic model by minimizing the root mean square error between the manual and the machine scores. Experiments on 3 685 live Putonghua database (498 for test and 3 187 for training) shows that the evaluation acoustic models generated by the proposed method are significantly better than those by traditional approaches.
    Key wordscomputer assisted language learning; discriminative training; PSC; pronunciation quality evaluation
  • Review
    WANG Zhen1,2, LIU Huidan1,2, WU Jian1
    2013, 27(1): 108-115.
    Abstract ( ) PDF ( ) Knowledge map Save
    The new national standard GB 25914-2010 provides a practical standard of Mongolian shaping rules. However, there arent any Mongolian Shaping Engines and OpenType fonts which follow the new standard at present. To display Traditional Mongolian correctly on the computer, this paper proposes a new Mongolian Shaping Model, which is not only in compliance with the new Standard but also very effective and general. With this model, this paper implements Mongolian shaping engines both in Qt4 (on Linux KDE) and Pango (on Linux GNOME). Experiments show that the engines display Mongolian text correctly. Besides, some applications on GNOME platform such as Firefox can shape Mongolian correctly, which suggests that our model will facilitate the Mongolian Operating System based on GNOME or KDE Desktop Environment.
    Key wordsnational standard; mongolian shaping engine; OpenType Font; Qt4; Pango
  • Review
    ZHAO Weina1,2, YU Xin2, LIU Huidan2,3, LI Lin1,4, WANG Lei5, WU Jian2
    2013, 27(1): 115-120.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the special features of Tibetan punctuation system, sentence boundary detection (SBD) is one of the most significant tasks in Tibetan text processing. This work focuses on detecting modern Tibetan sentence boundary which is ended by auxiliary, and proposes a Tibetan SBD algorithm.
    Key wordssentence boundary detection; Tibetan sentence boundary detection; Tibetan information processing; Chinese information processing
  • Review
    CHEN Xiaorong, YANG Hanyue, ZHENG Gaoshan, HUANG Qian
    2013, 27(1): 120-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents the design of Shui Nationality characters coding via Unicode, and establishes the True-Type font of Shui characters. Specifically, a coding method based on stroke shape for Shui Nationality characters is proposed, in which a Shui character is coded by an orderly sequence of three strokes on a characters angles. Finally, the corresponding keyboard input system is achieved with IMM - IME mechanism in Windows.
    Key wordsShui nationality character; Unicode; font; input method