2007 Volume 21 Issue 5 Published: 15 October 2007
  

  • Select all
    |
    Review
  • Review
    LI Yu-mei, CHEN Xiao, JIANG Zi-xia, YI Jiang-yan, JIN Guang-jin, HUANG Chang-ning
    2007, 21(5): 1-7.
    Abstract ( ) PDF ( ) Knowledge map Save
    Three complements are proposed in this paper to make better guideline of Chinese word segmentation, which are essential for building high quality Chinese segmented corpora. They are named entity (person name, location name and organization name) tagging rules, factoid (date, time, percentage, etc.) tagging rules and disambiguation rules. Because named entities and factoids are considered as segmentation units in many corpora, and the disambiguation problem is seldom defined in former segmentation guidelines. Actually, people always have different intuitions of ambiguity strings, so it is necessary to explain them in segmentation guidelines. Our practices have shown that specifying particular segmentation rules can help to decrease errors and inconsistencies in annotated corpus.
  • Review
    ZHAO Hai, Chunyu Kit
    2007, 21(5): 8-13.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research of automatic Chinese word segmentation has been advancing rapidly in recent years, especially after the First International Chinese Word Segmentation Bakeoff held in 2003. In particular, character-based tagging has claimed a great success in this field. In this paper, we attempt to generalize this method to subsequence-based tagging. Our goal is to find longer tagging units through a reliable algorithm. We propose a two-step framework to serve this purpose. In the first step, an iterative maximum matching filtering algorithm is applied to obtain an effective subsequence lexicon, while in the second step, a bi-lexicon based maximum matching algorithm is employed for identifying subsequence units. The effectiveness of this approach is verified by our experiments using two closed test data sets from Bakeoff-2005.
  • Review
    WANG Si-li, WANG Bin
    2007, 21(5): 14-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, two statistical measures-Coupling Degree of Double Characters (CDDC) and Difference of t-test (DT), are applied for overlapping ambiguity resolution in Chinese word segmentation. First, all possible overlapping ambiguities are found out by using the segmentation dictionary, and then a simple linear combination of CDDC and DT is used for ambiguity resolution. The experimental results show that our method performed better than the combination of Mutual Information of Double Characters and DT, which was proved to be a very effective method for overlapping ambiguity resolution in previous work.
  • Review
    ZHOU Qiang, ZHAO Ying-ze
    2007, 21(5): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese functional chunks are defined as a series of non-overlapping, non-nested skeleton segments of a sentence, representing the implicit grammatical relations between the sentence-level predicates and their arguments. In this paper, we proposed two statistical models for parsing four main functional chunks in a sentence. In the chunk boundary detection model, we focus on building the sub models based on SVM algorithm for detecting SP (subject-predicate) and PO (predicate-object) boundaries. In the sequence labeling model, we formulate the chunking task as a sequence labeling problem and base our model on CRF algorithm. By introducing some revision rules, we build a combined parsing model which integrates the advantages of both statistical models and have achieved the best F-Score of 82.93%, 86.58%, 78.46% and 86.64%for subject, predicate, object and adverb functional chunks respectively. Experimental results show that the complex clauses and serial verb structures are the main recognition difficulties.
  • Review
    DUAN Xiang-yu, ZHAO Jun, XU Bo
    2007, 21(5): 25-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Action-based dependency parsing, also known as deterministic dependency parsing, has often been regarded as an efficient parsing algorithm while its parsing accuracy is a little lower than the best results reported by more complex parsing models. In this paper, we compare action-based dependency parsers with complex parsing methods such as generative and discriminative parsers on the standard data set of Penn Chinese Treebank. The results show that, for Chinese dependency parsing, action-based parsers outperform generative and discriminative parsers. Furthermore, we propose two kinds of models for the modeling of parsing actions in action-based Chinese dependency parsing. We take the original action-based dependency parsers as baseline systems. The results show that our two models perform better than the baseline systems while maintaining the same time complexity, and our best result improves much over the baseline.
  • Review
    ZHANG Liang, CHEN Jia-jun
    2007, 21(5): 31-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on a large amount of rightly parsed examples with which both parsing procedures and parsing results are recorded, syntactic parsing can be carried out by searching similar example or fragment, and matching similar language structure and analysis in the examples. This embodies the assumption that human language perception and production work with representations of concrete language experiences, rather than with abstract grammar rules. In this paper, we propose a new parsing technique based on syntactic pattern matching. We extract syntactic patterns from a large-scale tree bank, and establish a library of syntactic patterns/sub-patterns and corresponding reduction procedures beforehand. Parsing tasks are fulfilled by pattern matching and partial pattern transforming. The experiments show that the parsing results are satisfying and the program execution speed is very high, achieving 0.46s/per sentence on average (CPU: Intel Core Duo 2.8G, Memory:1G).
  • Review
    HE Rui-fang, QIN Bing, LIU Ting, PAN Yue-qun, LI Sheng
    2007, 21(5): 36-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recognizing time expressions is the foundation of its normalization, and its performance directly influences the robustness of the normalization. This paper proposes a new method for recognizing the extents of the time expressions based on dependency parsing and error-driven learning, which begins with time trigger word (namely, the syntactic head of dependency relation), uses Chinese dependency parsing to recognize the extents of the time expressions, Subsequently, we use the transformation-based error-driven learning to improve the performance., which can automatically acquire and modify the rules and get 3.5% increase after applying the learned rules. Finally, F1 = 76.38% and F1 =76.57% results are obtained on the closed and the open test set respectively.
  • Review
    LIU Song-bin, DU Yun-cheng, SHI Shui-cai
    2007, 21(5): 41-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method of computing PageRank based on transfer matrix decomposition. Based on the PageRank random surfer model, the method decomposes the Markov states transfer matrix, so that the memory cost, computational complexity and I/O needs are reduced drastically. Experiments show that each iteration can be completed in 30 seconds and that the peak memory demand is 585MB during the PageRank computation of 17 million Web Pages containing 280 million links, indicating that this method meets the demand for engineering applications.
  • Review
    HU Yi, LU Ru-zhan, LIU Hui
    2007, 21(5): 46-50.
    Abstract ( ) PDF ( ) Knowledge map Save
    The dependence analysis between concepts is usually one of the key points for improving the performance of information retrieval system. In this paper, we explore a bootstrapping method to automatically extract semantic patterns for identifying the “(geographical) is-part-of”, “(entity) function” and “(motion) object” relations between concepts in contexts. A system, named SPG (Semantic Pattern Getter), is developed. Our contributions lie in: (1) introducing a bi-sequence alignment algorithm in bioinformatics to generate candidate patterns, and (2) defining a new evaluating metric for patterns’ confidences. In terms of the automatic recognition of the three relations, the experiments show that the pattern set generated by SPG achieves higher precision and coverage than DIPRE does.
  • Review
    WANG Gen, ZHAO Jun
    2007, 21(5): 51-55.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a new method called Multi-redundant-labeled CRFs and applies it on sentence sentiment analysis. This method can not only solve ordinal regression problems effectively, but also obtain global optimal result over multiple cascaded subtasks by merging subjective/objective classification, polarity classification and sentimental strength rating into an integrated model, with each subtask maintaining its own feature types. Experiments on sentiment classification of sentences show a better performance than standard CRFs, and thus validate the effectiveness of this method. Additionally, this method theoretically provides a way to solve ordinal regression problems for the algorithms whose training is based on maximization likelihood estimation.
  • Review
    ZHANG Xi-juan, WANG Hui-zhen, ZHU Jing-bo
    2007, 21(5): 56-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    In text classification tasks, these well-known feature selection methods such as information gain adopt conditional independence assumption between various features. However, this assumption would result in serious redundancy problems among various selected features. To alleviate the redundancy problem within the selected feature subset, this paper proposed a method based on minimal redundancy principle (MRP) for feature selection, in which correlations between different features are considered in feature selection process , and a feature subset with less redundancy can be built. Experimental results showed that MRP method can improve the effectiveness of feature selection, and results in better text classification performance (in most cases).
  • Review
    ZHENG Ya-bin, LIU Zhi-yuan , SUN Mao-song
    2007, 21(5): 61-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    We report experiments on song lyrics based on natural language processing techniques. Song lyrics play an important role of the semantics in songs; therefore, analysis of lyrics may be a complement of acoustic methods. We investigate the lyrics corpus based on Zip’f Law using both character and word as a unit, which proves the validness Zip’f Law in such corpus. Also, we find a set of lyrics that are similar to each other by means of vector space mo-del. Moreover, we discuss how to use the time annotation for further analysis; detecting the repetition of songs identifying rhythms, retrieving songs and soon. Preliminary experiment shows the effectiveness of our proposed method.
  • Review
    YANG Yu-hang, ZHAO Tie-jun, ZHENG De-quan, YU Hao
    2007, 21(5): 68-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a method of ranking bloggers based on link analysis, which can exemplify the characteristics of blogs and reduce the influence of link spamming. This method can also bring convenience to users to read blogs and supply a new methodology for information retrieval in the blogosphere. To ensure the reliability of the ranking results, some evaluation indicators for the importance of bloggers are given, and the grading result of bloggers using the proposed method is compared with those using these indicators. At last, correlation analysis shows consistency between the proposed method and the evaluation indicators.
  • Review
    YAO Tian-fang, LOU De-cheng
    2007, 21(5): 73-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents how to identify the topics as well as the relations bewteen the topics and the sentimental descriptive terms in a Chinese sentence, and how to compute the sentiment orientation (polarity) of the topics. We extract the topics and their attributes from a sentence with the help of a domain ontology, then identify the relations between the topics and sentimental descriptive terms based on parsing results, and finally determine the polarity of each topic in the sentence. The experiment has shown that the F-measure of the improved SBV polarity transfer algorithm for identifying topics and the polarity reaches 72.41% as compared with the manual annotation corpus which serves as a golden standard. It is increased by 7.6% and 2.09% than the F-measure of the original SBV and VOB polarity transfer algorithm respectively. Therefore, the proposed improved SBV polarity transfer algorithm is reasonable and effective.
  • Review
    WANG Jin, CHEN Qun-xiu
    2007, 21(5): 80-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    There are a variety of phrase ambiguities in Chinese. It is difficult to determine the correct syntactic structure of Chinese sentences with only part-of-speech information. Based on the observation on ambiguous phrases, this paper at first analyzes the problems of determining ambiguous boundaries and ambiguous structural relations of Chinese phrases, points out seven types of phrase ambiguities, then concludes four types of collocation information which are vital for processing ambiguous phrases. A disambiguation algorithm using both semantic and collocation knowledge is proposed consequently. The experimental result on 887 ambiguous phrases shows that this algorithm increases the disambiguation accuracy from 82.3% to 87.18%.
  • Review
    WEI Wei, DU Jin-hua, XU Bo,
    2007, 21(5): 87-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper describes a Hierarchical chunking-phrase based (HCPB) statistical translation model. The model not only comply with formal synchronous context-free grammar but also learned partial parsing knowledge using CRF (Conditional Random Fields) . Therefore it can be taken as combination of fundamental ideas from both syntax-based translation and phrase-based translation. The decoder for HCPB MT system is based on Chart-CKY algorithm, and integrates N-gram language model effectively. In our benchmark evaluation focusing on Chinese-English spoken language translation. The method achieves higher accuracy in measure of Bleu and NIST score in IWSLT2005.
  • Review
    HE Yan-qing, ZHOU Yu, ZONG Cheng-qing, WANG Xia
    2007, 21(5): 91-95.
    Abstract ( ) PDF ( ) Knowledge map Save
    The phrase translation pair extractions is one of the key techniques in the Phrase-based Statistical Machine Translation. Och’s phrase extraction method heavily depends on word alignments, so only the phrase pairs which are fully consistent with the word alignments are extracted. This paper proposes a method of phrase pair extraction with a flexible scale. This method can extract those phrase alignments which Och’s method can not obtained. The flexible scale is based on the two features: POS and dictionary information. Our experiments have shown that our method outperforms Och’s method significantly.
  • Review
    HAO Xiao-yan, LIU Wei, LI Ru, LIU Kai-ying
    2007, 21(5): 96-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese FrameNet project is producing a lexicon of Chinese for both human use and NLP applications, based on the principles of Fillmore’s Frame Semantics. It includes two parts. One part is the Chinese FrameNet databse(CFN), which contains frames bank, sentences bank, and lexical unit bank. The other part is a suite of software tools related to the CFN, which includes the database management system and the Web-based demonstration system. The paper will give a brief introduction about the description systems of these two parts.
  • Review
    WU Hong-lin, LIU Shao-ming, YU Ge
    2007, 21(5): 101-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper proposed a word alignment model which matches words by maximum matching on a weighted bipartite graph and measures word similarity in terms of morphological similarity, semantic distance, part of speech and co-occurrence. The experiments on Chinese-Japanese word aligment shows that this model can partly solve some problems of existing word alignment methods, such as the unknown word problem, the synonym problem and the global optimization problem. In the experiment, the F-score of our method is 80%, better than the F-score 72% of GIZA</sub><sub>++.
  • Review
    ZAN Hong-ying, ZHANG Kun-li , CHAI Yu-mei , YU Shi-wen
    2007, 21(5): 107-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    The functional words of modern Chinese take on the perplexing syntax roles. They have strong individual characteristics with distinct usages. By now, the studies on functional words of modern Chinese are mostly served for people. These descriptions can not avoid from subjectivity and illegibility. So they are not easy to be applied directly to natural language processing. This paper is taken up with the construction of a functional word knowledge base of modern Chinese that is suit to computational application, based on the current research achirements related with functional words. It will lay the foundation for automatic identification of functional word usages of modern Chinese.
  • Review
    LI Wei-gang, LIU Ting, LI Sheng
    2007, 21(5): 112-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper a novel method based on bilingual corpus is proposed to extract phrasal paraphrase examples. We focus on extract paraphrases of ambiguous phrases. A bilingual pair is the original input. Then all candidate paraphrases are extracted from word aligned bilingual corpus. The bi-direction model is designed to acquire confident paraphrases according to the coherence between the candidate phrases and the input phrases. The experimental results show that the synthesis precision is about 60%.
  • Review
    JI Tie-liang, SUN Wei-wei, SUI Zhi-fang
    2007, 21(5): 118-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Subcategorization of verbs is an essential issue and plays an important role in syntactic parsing, semantic roles labeling and etc. A sufficient subcategorization frame type set is critical for subcategorization acquisition. By now, a set of subcategorization frame types has come to an agreement in English, while no standard subcategorization frame type set for Chinese verbs has been achieved. In this paper we apply a semi-supervise method for subcategorization frame type acquisition with linguistic theory and statistical algorithm. Firstly we create a set of seeds of subcategorization patterns according to linguistics theory. And then a semi-supervise machine learning method is applied to analyze the corpus for extending the seeds. Contrasted with a corpus based subcategorization frame type acquisition mehtod, our method gains better precision and coverage.
  • Review
    ZHANG Hai-lei, CAO Fei-fei, CHEN Wen-liang, REN Fei-liang, WANG Hui-zhen, ZHU Jing-bo
    2007, 21(5): 126-130.
    Abstract ( ) PDF ( ) Knowledge map Save
    The purpose of Entity Mention Detection (EMD) is to recognizel all mentions of entities in a document, involving recognition of named entities, noun words and pronoun coreference etc. In this paper, we propose an approach for Chinese entity mention detection by integrating multi-level features into the Conditional Random Fields (CRFs) framework. These features used include characters, phonetic symbols, lexical words and part-of-speech, named entities, and frequency statistics. All EMD subtasks are integrated into a three-stage pipeline framework in which three different CRFs classifiers are used to label different attributes sequentially in a predefined order. The system described here is the our submission to NIST ACE07 EMD Evaluation project, and achieved rank-2 performance in ACE07.
  • Review
    ZHANG Rui-peng, , SONG Rou
    2007, 21(5): 131-135.
    Abstract ( ) PDF ( ) Knowledge map Save
    The scope of negative words, including the simple negator “不” and the compound negators “从不”, “很不”, “不能”, “不会", are paid more attention in linguistics. However, many researches are focused on the negation scope in single sentence. In this paper, we find out some formal rules about the negation scope in compound sentences. It will be worthy to automatic syntax analysis of long sentences and machine translation.