2015 Volume 29 Issue 1 Published: 10 January 2015
  

  • Select all
    |
    Survey
  • Survey
    SONG Yang, WANG Houfeng
    2015, 29(1): 1-12.
    Abstract ( ) PDF ( ) Knowledge map Save
    Coreference resolution, as a challenging issue, has been noted by NLP researchers for a long time. In recent twenty years, many kinds of advanced NLP techniques have been applied on this problem, and some of them have achieved significant improvements. In this paper, we first introduce some basic concepts and formalized this isuse. Then we summarize different research strategies adopted by researchers in recent decades. We highlight the feature engineering, which lies in the core of coreference resolution. Finally we describe the recent evaluations for this task and discusssome key issues and prospects in the future.
  • Survey
    WU Fengwen
    2015, 29(1): 13-18.
    Abstract ( ) PDF ( ) Knowledge map Save
    The study on Chinese Compound Sentences is essential to the information processing. This paper summarizes the past researches on compound sentences, including compound sentences modeling, relation markers recognition, structure recognition, compound sentences parsing and corpus construction. It also reveals the prospects and possible research trends in further studies.
  • Language Analysis and Language Resources Construction
  • Language Analysis and Language Resources Construction
    XIONG Dan, LU Qin, LUO Fengzhu, SHI Dingxu, ZHAO Tiancheng
    2015, 29(1): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Personal names and terms of address are important parts of named entities. The recognition of personal names as well as terms of address is ans essential issue in natural language processing. This paper presents a classification and annotation scheme for personal names and terms of address from the perspective of named entity recognition and information extraction on a corpus of four Chinese classical novels. Personal names and terms of address are categorized into simple types and compound types. And the compound-type is further categorized into four subtypes, fixed expressions, appositive constructions, subordinate constructions of affiliation, and other subordinate constructions. This paper also presents a comparative analysis on these types and the characteristics of the four novels based on full statistics of the annotated corpus.
  • Language Analysis and Language Resources Construction
    DU Jiali, YU Pingfang
    2015, 29(1): 28-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper discusses data structure of garden path phenomenon (GPP). The data structure of GPP belongs to cognitive tree-liked structure rather than the other structures, e.g. word set structure in pre-grammar condition, linear grammatical structure in syntactic understanding, and ambiguous map-liked structure in semantic-matched multiple cognition. The distinctive structure features of GPP include. (1) In the early understanding, the data structure of GPP shows a linear feature; (2) in the medium-term understanding, semantic trigger point brings the breakdown of the original model, and the data structure of GPP is a word set structure; (3) in the late understanding, processing breakdown results in backtracking and GPP creates a tree-liked data structure at the end; (4)the dynamic understanding of GPP is the integration of two structures except map-liked one, and the activation of semantic trigger point brings additional cognitive load. The difference between tree-liked data structure of GPP and map-liked data structure of ambiguity reflects the dissimilarity between these two syntactic phenomena from the perspective of data structure, which provides the theoretical support for computational linguistics to interpret GPP.
  • Language Analysis and Language Resources Construction
    ZHAO Jianjun ,YANG Xiaohong, YANG Yufang
    2015, 29(1): 38-43.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on 30 narrative texts of mandarin Chinese with the sentence focus annotated by 20 subjects, a statistical analysis is conducted to examine the influence of discourse rhetorical structure on focus distribution. The result shows that about 30% of the sentences in the narrative discourse have no focus. It is further revealed that the nuclearity has remarkable influence on focus distribution: about 80% of the nucleus sentences had focus but only 60% of the satellite sentences had focus. The sentences of the highest hierarchy have less focus. The narrative discourses consist of ten main rhetorical relations, in which the conjunction relation and the elaboration relation have the most sentences with focus and the attribution relation has the least sentences with focus.
  • Language Analysis and Language Resources Construction
    JIA Suimin, LEI Lili, HU Mingsheng
    2015, 29(1): 44-48.
    Automatic identifying the relation words of compound sentences is a fundamental issue in the field of Chinese information processing. This paper describe a rule based method for automatic identification of compound sentence relation words. To construct the rule, 12 featuresare summarized from the corpus. Then a match algorithm is described to obtaind the candidate relation word sequence. Finally the context of the relation words is employed to match with the rules. Experiment results show that this method achieves an accuracy of 70.9%.
  • Language Analysis and Language Resources Construction
    SUN Ruixin
    2015, 29(1): 49-56.
    The vowel in a nasal coda syllable will become a nasalized one. The issue is how to measure the degree of being nasalized. This paper puts forward a method based on the bandwidth of formants and the duration of nasalized part of the vowel after a deep acoustic analysis of the speech sound. We find that the nasalized degrees of vowels in alveolar nasal syllables are less than that of vowels in velar nasal syllables. The degree of the former is 0.410 and the latter is 0.718. The top degree lies in the high vowels, which are easy to be nasalized.
  • Language Analysis and Language Resources Construction
    ZHAO Zhiwei, QIAN Longhua, ZHOU Guodong
    2015, 29(1): 57-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    Cross Document Coreference(CDC) resolution is an important step in information integration and information fusion. As a consequence, a CDC corpus is indispensable for research and evaluation of CDC resolution. Given the fact that no Chinese CDC corpus is publicly available oriented for information extraction, this paper describes how to build a CDC corpus based on the ACE2005 Chinese corpus via automatic generation and manual annotation, which covers all the ACE entity types. The corpus is made publicly available to advance the research on Chinese CDC resolution. In addition, this paper analyses the types and characteristics of CDC in Chinese text as well as proposes the concept of two metrics, i.e., “variation perplexity” and “ambiguity perplexity”, to evaluate the difficulty of Chinese CDC resolution, providing some insights for further CDC research.
  • Machine Translation
  • Machine Translation
    YU Jingsong, WANG Huilin, WU Shenglan
    2015, 29(1): 67-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic Bilingual Chunk Alignment has important application value for Machine Translation, Computer Aided Translation and other fields. In this paper, a Chunk Partition Scoring method is proposed based on the Degree of Adhesion and the Degree of Relaxation to make the chunk partition of source language and target language benefit each other. A novel bilingual chunk alignment algorithm is proposed. Compared with previouswork, this algorithm does not require bilingual chunk partitions, however, the chunk partition score is dynamically calculated during alignment searching. The importance of precision is far beyond recall of this approach.
  • Machine Translation
    WANG Huilan, ZHANG Keliang
    2015, 29(1): 75-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aimed at the application in Machine translation, this paper conducts a research on the construction of Chinese Sentence-Category Dependency Treebank (CSCDT) based on the theory of Hierarchical Network of Concepts (HNC). The conceptual category tagset and the Sentence-Category relation tagset for the treebank are presented together with the example tree of CSCDT. Compared with other Chinese treebanks, this paper discusses two advantages of CSCDT. In addition, the translation template of Sentence-Category dependency subtree to string are defined to construct translation template library for Chinese-English machine translation.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LI Lishuang, WANG Yiwen, HUANG Degen
    2015, 29(1): 82-87.
    Abstract ( ) PDF ( ) Knowledge map Save
    A term extraction system based on information entropy and word frequency distribution variety is presented. Information entropy can measure the integrality of the terms while word frequency distribution variety can measure the domain relativity of terms. Incorporating with simple linguistic rules as an addition filter,the automatic term extraction system integrates information entropy into word frequency distribution variety formula. Preliminary experiment on the corpus of automotive domain indicates that the precision is 73.7% when 1,300 terms are extracted. The result shows that the proposed approach can effectively recognize the terms with lower frequency and the recognized terms are well of integrality.
  • Information Extraction and Text Mining
    XIA Fei, CAO Xinyu, FU Jianhui, WANG Shi, CAO Cungen
    2015, 29(1): 88-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic discovery of part-whole relations from the Web is a fundamental but critical problem in knowledge engineering. This paper proposes a graph-based method of extracting part-whole relations from the Web. Firstly, we download snippets from Google using part-whole query patterns, and then we built a graph by extracting word pairs with a coordinate structure from these snippets, with the co-occurring words as nodes and the frequency count as edges’ weight. A hierarchical clustering method is used to cluster the correct parts, which is optimized by five methods of adjusting the edge weight: reduce the weight of comma-edges, cut the low-frequency edges, enlarge the weight of edges in the loop, enlarge the weight of edges in which two nodes share the same suffix, and enlarge the weight of edges in which two nodes share the same prefix. Experimental results show that the five methods increase the recall substantially.
  • Information Extraction and Text Mining
    GUO Shaohua , GUO Yan, LI Haiyan, LIU Yue, ZHANG Jin, CHENG Xueqi
    2015, 29(1): 97-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    An extensible framework of web key information extraction is presented in this paper. This framework combine automatic information extraction algorithms and template detection algorithms, essentially improving the precision and efficiency of extraction. Some key parts of this framework can be replaced as required, therefore it has excellent extensibility. Furthermore, this paper also describes an orthogonal filter algorithm, which improves the precision of template generation. And the experiments provide positive results for this method.
  • Information Extraction and Text Mining
    GAO Jiawei, LIANG Jiye ,LIU Yanglei,LI Ru
    2015, 29(1): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-label learning is proposed to deal with the ambiguity problem in which a single sample is associated with multiple concept labels simultaneously, while the semi-supervised multi-label learning is a new research direction in recent years. To further exploit the information of unlabeled samples, a semi-supervised multi-label learning algorithm based on Tri-training(MKSMLT) is proposed. It adopts ML-kNN algorithm to get more labeled samples, then employs the Tri-training algorithm to use three classifiers to rank the unlabeled samples. Experimental results illustrate that the proposed algorithm can effectively improve the classification performance.
  • Information Extraction and Text Mining
    ZHOU Hong, LIU Jinling , WANG Xingong
    2015, 29(1): 111-117.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, the short text information flow has occured in some public media. For this kind of data, a retrospective topic identification model is presented with an improved weight estimation. It employes the value of BIC for clustering to improve the clustering accuracy. By dividing the time segments and removing isolated information point, the efficiency of the algorithm is further improved. The experimental results show that this method achieves good accuracy and efficiency in the topic detection of the short text information flow.
  • Information Extraction and Text Mining
    CHEN Qian, GUO Xin, WANG Suge, ZHANG Hu
    2015, 29(1): 118-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic Detection has been widely used in text mining and NLP, while the basis of which is topic structure modeling. In this paper, we propose a semantic hierarchical topic structure model to describe multi-granularity topic structure. This model utilizes the characteristics of domain ontology, with each concept in the ontology mapped to a topic. The concepts in concept list are respresented as topic-tree leaf nodes, and nodes in each layer can be treated as multinomial mixture distribution on the lower layer nodes. This delicate structure is easily adapted to multi-granularity topic structure in real world text stream. Experiment showed that the structure model reflect rich multi-granularity semantic feature of topic.
  • Information Extraction and Text Mining
    SHEN Yuanfu, SHEN Yuewu
    2015, 29(1): 126-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose Mix-grams method to improve online SVM filter for spam filtering. Though online SVM classifier brings high performance on online spam filtering, its computational cost is remarkable compared to other methods such as Logistic Regression. In this paper, we propose a type based n-gram extraction method to reduce the feature dimension of online SVM filter. Experimental results demonstrate that the method improves the filter performance and reduces the computational cost of online SVM filter.
  • Information Extraction and Text Mining
    FEI Wenbin, TANG Xianghong, WANG Jing, LIN Xinjian
    2015, 29(1): 133-138.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to avoid the permanent change of the text content caused by watermark embedding, this paper proposes a reversible watermarking algorithm for Chinese text document based on prediction error expansion englightened by the reversible watermarking for the image. Taking the sentence as the unit, The algorithm selects the words to be replaced according to the size of context collocation degree, and then realizes the embedding by the prediction error expansion and Chaos Sequence. Results show that this algorithm not only has the higher security, but also can extract watermark effectively while maintaining an exact restoration of the original text.
  • Syntactic, Semantic Analysis and Social Computation
  • Syntactic, Semantic Analysis and Social Computation
    ZHAO Guorong,WANG Wenjian
    2015, 29(1): 139-145.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese syntax has complex structure and high dimension features, and the best known Chinese parsing performance is still inferior to that of other western languages. In order to improve the efficiency and accuracy of Chinese parsing,we propose a L2-norm soft margin optimization structural support vector machines (structural SVMs) approach. By constructing the structural function ψ(x,y), the input information of syntactic tree can be mapped well. Since Chinese syntax has a strong correlation, we use father node of phrase structure trees to enrich the structure information of ψ(x,y). The experiment results on the benchmark dataset of PCTB demonstrate that the proposed approach is effective and efficient compared with classical Structural SVMs and Berkeley Parser system.
  • Syntactic, Semantic Analysis and Social Computation
    WEI Chuyuan, ZHAN Qiang, FAN Xiaozhong, MAO Yu, ZHANG Dakui
    2015, 29(1): 146-154.
    Abstract ( ) PDF ( ) Knowledge map Save
    Question understanding of complex questions is a challenging issue in question answering system. For complex questions containing events (actions) information, this paper presents a question semantic representation (QSR) model based on semantic chunk. The semantic components of a complex question are labeled abstractly as the question focus, the question topic and the question event. A Semantic Structure of Question Event is then created to represent the semantic information of question event, including the question focus chunk, the question topic chunk and the question event chunk. To map the interrogative sentence into this question semantic representation, the Conditional Random Fields model is adopted for automatic semantic labeling of question semantic representation. The results show that automatic semantic labeling gains better performance.
  • Syntactic, Semantic Analysis and Social Computation
    Odbal, WANG Zengfu
    2015, 29(1): 155-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper treat the phrase-level sentiment analysis as a sequence annotation problem, and proposes an extension model of conditional random fields, YACRFs, to annotate sentiment orientation of phrases. In contrast to previous works focusing on linear-chain CRFs, which corresponds to nite-state machines wtih efficient exact inference algorithms,we wish to label sequence data in multiple interacting ways—for example, performing word based semantic orientations tagging and phrase-level sentiment analysis simultaneously, increasing joint accuracy by sharing information between them. The proposed model incorporates the word emotional orientation analysis process and the phrase analysis through the incorporation of the features of polarity words, phrase rules template as well as part of speech characteristics. Experiments shows the proposed model performs best with an accuracy of 81.07%. And applied the results in sentence-level sentiment analysis, it brings again the best accuracy of 94.30%.
  • Syntactic, Semantic Analysis and Social Computation
    ZHANG Sheng, LI Fang
    2015, 29(1): 163-169.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a new media, Microblogging has been playing an indispensable role in people’s life. To extract sentimental information from the Microblogs, this paper introduces a two-stage CRF model and an iterative two-stage CRF model. The two-stage CRF model reaches an F-score of 0.505 on the COAE2014 evaluation data, and the iterative two-stage CRF model reaches an F-score up to 0.513 by an improvement in the recall.
  • Other Language in/around China
  • Other Language in/around China
    LIU Huidan, NUO Minghua, MA Longlong, WU Jian, HE Yeping
    2015, 29(1): 170-177.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on link analysis and Tibetan encoding detection, this paper focuses on mining the Tibetan text resources over the internet with a crawler, and analyzes the distribution of Tibetan text. Statistical data shows that, more than 50% inland Tibetan web sites are hold by organizations in Qinghai province, and about 87% web pages belong to 31 large web sites. People prefer to use Unicode as the encoding of their new web pages rather than legacy encodings. It is practical to to extract Tibetan text from the pages with the natural tag information, such as HTML elements, column information and punctuations. The text can be used to build raw corpus, text classification corpus, and internet word/phrase corpus and so on. Word frequency statistics and language model can also be derived. In addition, some bilingual corpus can also be extracted.
  • Other Language in/around China
    BAO Feilong, GAO Guanglai, BAO Yulai
    2015, 29(1): 178-182.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with Out-of-Vocabulary detection on Mongolian spoken term detection system, this paper proposes a Mongolian spoken term detection method based on phoneme confusion network.The Confidence Measure is improved by incorporating phoneme confusion matrix. Experimental results show that our method obtains a satisfying performance in the task of Mongolian Out-of-Vocabulary detection, with 6% improvement in precision rate and 2.69% in recall rate.
  • Other Language in/around China
    CHEN Xinyi, XIA Jianhua, DU Yuxiang, WAN Fucheng, YU Hongzhi
    2015, 29(1): 183-190.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyzes the degree distribution of Tibetan Web community and reveals the defects in the maximum degree-first search algorithm. It proposes a more efficient bisection degree search algorithm as well as a hybrid strategy of combining maximum degree and bisection degree search. According to Community division principle, this paper designs and realizes the search algorithm for Tibetan web community. The result shows that the proposed method are better than other search algorithms in terms of average search steps and average query informativeness.
  • Other Language in/around China
    Bianba Wangdui, Drolkar, DONG Zhicheng, WU Qiang, WANG Longye
    2015, 29(1): 191-196.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, a sorting algorithm for cotemporary Tibetan syllable is presented by Cartesian product on the basis of a definition of Tibetan component priotiry. This method conforms to the Tibetan morphology and syntax. Finally, all grammar rules that related the Tibetan syllable ‘’ are tested and it proves that the algorithm meets the demands of the contemporary Tibetan dictionary.
  • Other Language in/around China
    SHI Jianguo ,HOU Hongxu, BAO Feilong
    2015, 29(1): 197-202.
    Abstract ( ) PDF ( ) Knowledge map Save
    Slavic Mongolian is the daily language in Mongolia, which is also known as Cyrillic Mongolian or new Mongolian. This paper explores the Slavic Mongolian word segmentation by combining the dictionary with rules. We first preprocess with the dictionary for the words of high-frequency or not consistent with rulesm then deal with the rest words with rules to generate n-best candidates for final decision We combine the two different methods, taking bothadvantages and achieving excellent performance in the Slavic Mongolian word segmentation.