2011 Volume 25 Issue 2 Published: 15 April 2011
  

  • Select all
    |
    Review
  • Review
    YANG Min,CHANG Baobao
    2011, 25(2): 3-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    Among all the researches on semantic role labeling(SRL), one important method which has been carried out by many researchers is to convert the task into a classification problem by selecting features, and thenapplying different kinds of classifiers .While almost all the researches based on this kind of supervised learning have been done on the same corpus-Penn Proposition Bank, here we test the same method on a new corpus—Peking University Chinese NetBank, with the goal to figure out whether the wildly used features have a strong dependence on corpus. The experiments have shown that the method and the features have good performance on the new corpus . And compared to the PropBank, some features play crucial roles in classification on the new corpus.
    Key wordssemantic role labeling; Peking University Chinese NetBank; sequence labeling
  • Review
    LI Fang1,HE Tingting1,2
    2011, 25(2): 9-15.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we design a variety of summary modes for Query-directed Multi-document Summarization to satisfy the individual requirements by diversified results. Firstly, the documents are represented as two-layer complex networks, whose nodes describe text and paragraph respectively, and the network community discovery algorithm is used to cluster text and paragraph. Then, on the basis of network structure of documents, we design four summary modes besides the traditional summary mode and summary element extract strategy. They are document summary, general summary, partial summary, global summary and detailed summary. With the clues of sub topic, users can browse information in certain logical sequence to their own.
    Key wordsquery-directed multi-document summarization; sub-topic discovering; multi-mode summary
  • Review
    DING Xiao, SONG Fan, QIN Bing, LIU Ting
    2011, 25(2): 15-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event extraction is an important research issue in information extraction. This paper focuses on the music domain, and describes a method based on trigger clustering for event type discovering. Then we propose a method based on the filtering of keywords and triggers for event type recognition. For the event argument recognition, the method which is based on maximum entropy model is proposed in this paper. Evaluations on our corpus give a final F-score of 82.82% and 75.79% for type recognition and argument recognition.
    Key wordsevent extraction; event type detection; event type recognition; event argument recognition
  • Review
    LIU Bing, QIAN Longhua, XU Hua, ZHOU Guodong
    2011, 25(2): 21-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Kernel-based PPI (Protein-Protein Interaction) extraction systems can achieve better performance because of their capability to capture structural information, but at the expense of high computational complexity. This paper investigates the combination of diverse lexical, syntactic and particularly dependency information for feature-based protein-protein interaction extraction using SVMs. Experimental evaluation on multiple PPI corpora reveals that dependency information as well as base phrase chunking information is very effective for feature-based PPI extraction. Particularly, our method achieves a promising performance of 54.7 in F-measure on the AIMed corpus, surpassing other state-of-the-art feature-based ones.
    Key wordsPPI extraction;SVM;dependency information
  • Review
    WANG Jin1,2, WANG Huizhen1,2, ZHANG Li1,2
    2011, 25(2): 27-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we present a text representation method by using wikipedia categories as text features. This method can map each word of text to one of wikipedia categories. It can enhance the representation ability of features and reduce the dimensions of a text vector. An approach to clustering techniques is presented to resolve the limited coverage of wikipedia categories by mapping unknown words into predefined categories. Then a text category system is developed that uses these learned wikipedia categories as text features. The experimental results show that text representation based on wikipedia categories has the obvious effect of dimension reduction, achieving 5.14% improvement on F1 over the BOW-based method when 700 features are used for text classification.
    Key wordstext classification; text representation; wikipedia category
  • Review
    LI Guohua, ZAN Hongying
    2011, 25(2): 32-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most of the methods for title extraction from HTML documents are based on the structure of HTML document or the features of label. They do not considered the correlation between the title and the content. This paper proposes a method of title extraction from HTML documents based on similarity, which makes full use of the correlation between the title and the main body. The similarity between units are calculated and adjusted by the HITS algorithm. Then the “real title” is extracted in a series of steps. Experimental results show that this method performs well for “nonstandard HTML document” and has good generalization ability for “standard HTML document”.
    Key wordstitle extraction;similarity;Web information retrieval
  • Review
    ZHU Xiaofei1,2,3, GUO Jiafeng1, CHENG Xueqi1, DU Pan1
    2011, 25(2): 38-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    To address problems of both relevance measurement and redundance in traditional query recommendation approaches, in this paper, we propose a novel query recommendation approach based on Manifold Ranking. This approach exploites the intrinsic global manifold structure to capture the relevance among queries, and effectively avoids the deficiency of the relevance measurement in traditional approaches when dealing with high-dimensional query data. Meanwhile, it also reduces the redundance by boosting representative queries in the structure. Empirical experiments on a large scale query log of a commercial search engine show that query recommendation using Manifold Ranking is superior to both the traditional approach and the existing Hitting-time Ranking approach.
    Key wordsquery recommendation; manifold ranking; click-through data
  • Review
    XU Danqing, LIU Yiqun, CEN Rongwei, MA Shaoping, RU Liyun, YANG Lei
    2011, 25(2): 44-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    Different from alphabetic languages, input software is required to transform PinYin strings into characters for Chinese language. Input software therefore plays an important role in HCI process for Chinese users. In the research field of Chinese input method, it is important to look into users behavior information to improve the qualityof dictionary construction, the algorithm, the interaction design as well as the performance evaluation. However, there lacks such works due to the difficulties in collecting corresponding behavior data. With the help of a widely-used Chinese input software company, we collected user input logs under users agreement which contain 410 million input strings. With analysis into these input logs, we focused on the following behavior featuresinput string length distribution, character/word/phrase selection for different kinds of application software and the adoption of abbreviations. Conclusions help us to better understand users input behavior and show possible ways to improve input software designation.
    Key wordsChinese input software; user behavior; log analysis
  • Review
    LI Xianhua, YU Miao, SU Jinsong, LV Yajuan
    2011, 25(2): 49-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes different machine translation approaches to translating bibliographic information, such as person names, addresses, organization names and company names according to their different features. With dictionary and translation rules, most of them can be translated properly. For name translation, we design Pinyin conversion and Kana conversion methods. For address translation, organization name translation and company name translation, we propose a procedure which includes splitting, translating and reordering. Experiments show that these approaches achieve good results.
    Key wordsbibliographic information; machine translation; person name translation; address translation; organization name translation
  • Review
    TU Zhaopeng,LIU Qun,LIN Shouxun
    2011, 25(2): 55-61.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long distance reordering is a key problem in statistical machine translation (SMT). Hierarchical phrase-based model offers an alternative to address this problem by using hierarchical rules that could characterize both local and long distance reordering. However, extracting long distance reordering rules with traditional algorithm will cuase heavy cost in decoder time-and-memory. We propose a new algorithm to extract long distance reordering rules with an extra dependency restriction. Our experiments show that our method achieves 0.74 point improvement in BLEU score.
    Key wordsstatistical machine translation; hierarchical phrase-based model; long distance reordering; dependency restriction
  • Review
    LI Xiang1, XU Jin’an2, JIANG Wenbin1, LV Yajuan1, LIU Qun1
    2011, 25(2): 61-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    The demand for statistical machine translation (SMT) on mobile terminals is increasing, but the processor without floating point unit (FPU) restricts the translation speed. This paper proposes an approach to switch floating point operation to fixed point operation for decoder of SMT system on mobile terminals, and increase the translation speed on the processor without FPU. The experiments based on PC and mobile terminal show while this approach assures the quality of translation, the speed of our approach is 135.6% faster than the speed of floating point operation emulated by compiler. Therefore, this approach can efficiently increase the translation speed of SMT system on mobile terminals with weak ability in floating point operation.
    Key wordsstatistical machine translation; fixed point; mobile terminal
  • Review
    XIONG Hao,LIU Yang,LIU Qun
    2011, 25(2): 66-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    Previous related work of tree-based models treat rules as strings and then match rules using string matching algorithm. However, the performance of tree-based models is largely dependent on the parsing results, and for some languages, the precision of current parser is still far from state-of-the-art. So two rules with one different tag causing by parsing errors seems to be unmatchable. Under exact matching strategy, the size of available rules is implicitly scarce especially in tree-to-tree models, in which the performance is still unacceptable. In this paper, we present a tree kernel based fuzzy matching algorithm which computes the similarity between different rules. Experimental results on NIST 2005 Chinese-to-English test set show that our system achieve an absolute improvement of 1.3% in term of BLEU score over string matching system. Furthermore, when using the packed forest, our method still gets a relative improvement of 0.7 BLEU score.
    Key wordstree kernel; tree-to-string model; statistical machine translation; fuzzy matching
  • Review
    YAO Shujie 1,2, XIAO Tong1,2, ZHU Jingbo1,2
    2011, 25(2): 72-78.
    Abstract ( ) PDF ( ) Knowledge map Save
    In Statistical Machine Translation, effective selection of training data can generally reduce the burden of system training and decoding. To addressing this issue, , we propose a framework to select a small portion from the whole training data set for SMT by considering both coverage and sentence pair quality. Experimental results on CWMT2008 Chinese-to-English MT task show that our framework is effective to select a subset from the large training data set. Even trained on the 20% data selected by our framework, the SMT system can achieve comparable performance with the baseline system trained on all the data).
    Key wordssentence pair quality evaluation; coverage; statistical machine translation; linear sentence pair quality evaluation model; training data selection
  • Review
    SUN Meng 1, 2,YAO Jianmin 2,LV Yajuan 1,JIANG Wenbin1,LIU Qun 1
    2011, 25(2): 78-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents an improved feature extraction algorithm for maximum entropy based phrase reordering model. The algorithm can extract more accurate feature information of phrase reordering, particularly the feature of inverted phrases. It solves the problem of uneven distribution of feature information and increases the rate of correct translation. We use BLEU as a metric on Chinese-to-English translation, and the proposed algorithm obtains a relative improvement of 0.65% over baseline system.
    Key wordsmaximum entropy; feature extraction; statistic machine translation; reordering model
  • Review
    YANG Jiang1, HOU Min2, WANG Ning1
    2011, 25(2): 83-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    We put forward an approach to recognizing sentiment polarity in Chinese reviews based on the shallow text structure that is represented by topic sentiment sentences. Considering the features of reviews, we identify the topic of a review using an n-gram matching approach. To extract topic sentiment sentences, we compute the semantic similarity of a candidate sentence and the ascertained topic, and meanwhile determine whether the sentence is subjective. A certain number of these sentences are selected as representatives according to their semantic similarity value with relation to the topic. The average value of the representative topic sentiment sentences is calculated and regarded as the sentiment polarity of a review. Experiment result shows that the proposed method is feasible and can achieve relatively high precision.
    Key wordsshallow text structure; topic sentiment sentence; review; sentiment orientation analysis; sentiment
  • Review
    SONG Xiaolei1,WANG Suge1,2,LI Hongxia3,LI Deyu1,2
    2011, 25(2): 89-94.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes two kinds of methods to determine the sentiment orientation of a word based on Probabilistic Latent Semantic Analysis(PLSA). In the first method, the similarity matrix between target words and paradigm words is obtained by PLSA, and the polarity of each target word is then determined by voting. In the second method, we obtain the semantic cluster of target words by PLSA, and the polarity of a target word is then determined by a synonym-based method. The advantage to both methods lies in that they can work well without any external knowledge resources.
    Key wordsprobabilistic latent semantic analysis; sparse data; semantic clustering; sentiment orientation
  • Review
    LI Tingyu, GE Zhengrong, YAO Tianfang
    2011, 25(2): 94-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of search engine, the demand of question answering system becomes pressing. The first step of QA system dealing with question is judging whether a question is an opinion question or a non-opinion question and further classifying opinion questions. The paper firstly analyzes the current situation of QA system and Question Classification, and then summarizes some problems. We analyze the question in three layers. In the semantic layer, we extract three key words. In the syntactic layer, we classify the question into five typical question types by some rules. In the domain layer, we judge the domain by the number of related websites that are obtained via the search engine. Finally, based on three layers mentioned above, we develop a system for experiments and further analysis. The result of experiments shows that for the classification of opinion question, the proposed three-layer-scheme is very useful.
    Key wordsopinion question; QA system; question classification; natural language processing
  • Review
    JIA Yuxiang1,2, YU Shiwen2
    2011, 25(2): 99-105.
    Abstract ( ) PDF ( ) Knowledge map Save
    Metaphor, an expression to describe one thing in terms of another, is pervasive in natural languages. So metaphor processing is indispensible for natural language understanding. This paper proposes a lexicon based method to recognize nominal metaphor, one of the basic metaphor types. Semantic distance by TongYiCiCiLin and semantic relations in HowNet are combined for metaphor recognition, and the association between metaphor and lexicon based semantic information is also discussed.
    Key wordsnominal metaphor; metaphor recognition; lexicon; semantic distance; semantic relation
  • Review
    NUO Minghua1,2 , ZHANG Liqiang1, LIU Huidan1,2, WU Jian1, DING Zhiming1
    2011, 25(2): 105-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a method to extract phrase pairs from domain-specific Chinese-Tibetan bilingual corpus of laws, regulations and official documents. So far, widely used phrase extraction methods heavily depend on the result of word alignment or additional resources like part-of-speech or syntactic analysis and so forth. Taking account of inadequate resources in Tibetan at present, this paper proposes a two-phase Chinese-Tibetan phrase pairs extraction method. The first step is to extract the Chinese phrase (multi-word chunk) using Nagao's Algorithm and Substring Reduction Algorithm. The second step is to extract the candidate Tibetan translation for translation-ready Chinese phrase. This paper proposes Tibetan words sequence intersection algorithm (TIA) to extract Tibetan phrase. TIA works well on both 1-1 translation and 1-n translation (either continuous or discontinuous) Tibetan phrase.
    Key wordsChinese Tibetan phrase extraction; Tibetan information processing; Chinese information processing
  • Review
    JIANG Di1,3, LIU Huidan2, WU Bing3,4
    2011, 25(2): 111-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses a type of IPA input method and its software system with an applied system called Landie IPA (the Blue Butterfly IPA Input System). The coding principle of IPA is the classification according to graphic forms for symbols—perfect classification of which will guide users in their typing operations. The input technology makes an effect of continuous typing with high-frequency precedence. And the graphic design helps arrange multi-lingual layout.
    Key wordsIPA; input code; dynamic keyboard; continuous input; font design
  • Review
    ARZUGUL·Xerip1,2
    2011, 25(2): 117-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to mathematical logic, this paper carries on formalized description about “SUBS+NP” structure from the four aspects of syntactic patterns, argument and argument structure, semantic relationships between verb and noun, and logical expression as well. The result shows that the expression which may be selected by the verb means “” type gerunds form and participles form is determined by various semantic relationships between verb and noun in Uyghur language “SUBS+NP”structure. Moreover, the semantic relationship between verb and noun is subordinated to syntax coding, shows different semantic content with the aid of two syntax forms.
    Key words“SUBS+NP” structure; formalize; computational linguistics
  • Review
    ZHANG Haijun1,2, SHI Shumin3, DING Xiyuan2, HUANG Heyan3
    2011, 25(2): 122-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Constructing candidate words set based on repeats is an important way for Chinese Unknown Words Identification (UWI). There are two kinds of strategies used to extract repeatscharacter based and Chinese Word Segmentation (CWS) based. In this paper, a large number of comparative researches are implemented towards above two strategies, and a quantitative omission model for candidate unknown words based on CWS is presented to evaluate the problem of omission of unknown words. Studies show there is a good correlation between experimental results and the model outcomes. On the basis of discussions of the quantitative model, a reliable conclusion of Chinese UWI via two strategies is reached, which has certain reference value for follow-up researches of UWI.
    Key wordsunknown words identification; repeats; CRF; Chinese word segmentation