2014 Volume 28 Issue 4 Published: 10 April 2014
  

  • Select all
    |
    Survey
  • Survey
    YANG Bo, CAI Dongfeng, YANG Hua
    2014, 28(4): 1-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    Extracting useful information automatically from large-scale unstructured texts has been a long-standing goal of NLP and AI. And open information extraction is now widely pursued for effective web information acquisition. Open information extraction can be divided into dual and n-tuple entity relation extraction according to the number of arguments involved. In accordance with these two aspects, this paper analyses several typical methods for open relation extraction together with their defects. It is indicated that most current methods still belong to shallow semantic processing, hardly considering the implicit relation. Therefore, it is beleved that the adoption of joint inference strategy such as the markov logic and the ontology structure based inference can take advantage of multiple features. The combination of open and open up a promising prospect to infer the fine and full information for open information extraction.
  • Survey
    HOU Luying, JIA Yuan
    2014, 28(4): 12-20.
    Abstract ( ) PDF ( ) Knowledge map Save
    Focus is a general topic in linguistic study. Along with the development of experimental phonetics and psycholinguistics, research on the prosodic expression and cognitive processing of focus develops rapidly. The stu-dies mainly deal with the following three perspectives: the phonetic and phonological expression of focus, the relationship of focus and accent, as well as the processing and the brain mechanism of focus and prosody in sentence comprehension. This paper reviews the methodology and results of relative studies and discusses the existing problems, with the aim to shed a light on further research.
  • Language Analysis and Generation
  • Language Analysis and Generation
    XIAO Yonglei , LIU Shenghua , LIU Yue ,
    CHENG Xueqi , ZHAO Wenjing , REN Yan , WANG Yuping
    2014, 28(4): 21-28.
    Abstract ( ) PDF ( ) Knowledge map Save
    The emergence of social media services is seeing a large amount of short text such as tweets and reviews are generated every day. Mining those data attracts more interests from both industry and academia. And such data has already become an important source of information for marketing, stock prediction, etc. However, mining short text is non-trival since of extremely sparse text and lack of context. Thus we propose to enrich short text content by automatically identifying concepts in open knowledge bases such as Wikipedia, which are semantically related to them. In our work, firstly, through linkable pruning, concept linking and disambiguation, important n-grams in tweet and their related Wikipedia concepts are linked. Secondly, NMF (non-negative matrix factorization) is used to factorize concept-document matrix to get concepts' semantic neighbors. And related concepts are then expended for tweets. Experiments on the collection of tweets from TREC 2011 and Wikipedia 2011 show that our approach gets effective results.
  • Language Analysis and Generation
    JIANG Zhipeng, GUAN Yi, DONG Xishuang
    2014, 28(4): 29-36.
    Abstract ( ) PDF ( ) Knowledge map Save
    Hierarchical parsing is a simple and rapid complete syntactic analysis method, which can be decomposed into three stages: POS tagging, chunking and parsing tree construction. In this paper, chunking is further divided into base chunking and complex chunking, and conditional random field model is adopted for sequence labeling instead of maximum entropy model. Considering error accumulation, which is a particularly serious problem in hierarchical parsing, this paper presents a simple and practical error predicting and collaborative correcting method, by tracking the predicted errors in this layer to the next layer and combines prediction scores of two layers to correct error collaboratively. The experimental results show that hierarchical parsing with error correction achieves almost the same analytic precision of the mainstream prediction Chinese parsers.
  • Language Resources Construction
  • Language Resources Construction
    QIU Zhaowen, WU Xia,CHEN Haiyan
    2014, 28(4): 37-42.
    Abstract ( ) PDF ( ) Knowledge map Save
    In order to improve the management of web animation materials, a senmantic annotation algorithm based on fusion of text and visual features is proposed for web animation material. The context information of the animation material is first extracted, including its title, page caption, URL, ALT features. Then the candidate textual keywords are extracted by using WordNet semantic dictionary. We filter the annotation words by their correlation to the visual features. Finally, we build the semantic network over textual keywords and visual features to realize automatic annotation. Experiments show that the algorithm proposed in this paper can be effectively used in extracting semantic information from web animation material.
  • Language Resources Construction
    YANG Tianxin,PENG Weiming,SONG Jihua
    2014, 28(4): 43-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper designed a human-computer interaction graphical syntax tagging system based on the Sentence Pattern Structure. It's designed directly to support the Treebank constructing and deeply research base on the Sentence Pattern Structure. With the constraint of sentence pattern system and the supprot of lexical knowledge database, the hierarchy and word type tags of results are normalized effectively. To a certain extent, the consistency and quality of syntax results can be ensured. This paper illustrated the creative mode and experience of this system from the perspective of practice.
  • Language Resources Construction
    ZHANG Xiaomei, LI Ru, WANG Bin, WU Di, GAO Junjie
    2014, 28(4): 50-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    To deal with issues in the existing micro-blog subjective and objective classification such as high redundancy in features and failure in employing the complementarity among the feature selection method, this study proposes a feature fusion approach to subjective and objective classification of micro-blog. In order to get more effective features, the study combines a variety of different feature selection methods, and uses the feature fusion algorithm to select and fuse the basic features including word features, content features, micro-blog features and so on. The experimental results using Sina micro-blog data show that the feature fusion algorithm can achieve better performance than the best single one.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    WANG Pingze, CAO Cungen, WANG Shi
    2014, 28(4): 58-67.
    Abstract ( ) PDF ( ) Knowledge map Save
    Attributes are a special type of knowledge which is used to describe and identify concepts. Attribute names are proper nouns to express attributes. This paper presents a prefix- and suffix-based method to extract attributes iteratively from Web pages. In this method, each iteration consists of two phases. (1) Selecting a set of appropriate attribute prefixes and suffixes from the existing attribute seeds, and generating lexico-syntactic patterns to extract candidate attributes from Web pages. (2) Using a similarity-based model to validate candidate attributes to expand the existing set of seed attributes. We propose a group of validation models, and then compare the advantages and disadvantages of each model. We evaluate our method on a group of concepts in the geographic class and business class. Comprehensive experiments show that an average of 92.9% and 90.7% precision are obtained, respectively, and the original set of seed attributes are expanded nearly 100 times.
  • Information Extraction and Text Mining
    YU Ru, ZHU Chaoyang, HUANG Mingxuan
    2014, 28(4): 68-75.
    Abstract ( ) PDF ( ) Knowledge map Save
    All-weighted data model is characterized by its item weights distribution in each transaction records, changing with the different transaction records. Existing mining algorithm of weighted negative association rules can not be applied all-weighted data model. In this paper, a novel mining algorithm of all-weighted positive and negative association rules is presented for application in education data. The algorithm uses probability ratio instead of the traditional confidence, and adopts "support- probability ratio-interest" framework to estimate positive and negative all-weighted association rules. Using real educational information data and text data as test set, the algorithm proposed in this paper is more effective and more reasonable compared with the existing mining algorithms of positive and negative association rules.
  • Information Extraction and Text Mining
    CHANG Tianshu, LIN Hongfei
    2014, 28(4): 76-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    The number of Wikipedia articles and contributors grows at a very fast pace, therefore, a remarkable property of some Wikipedia articles were written by up to thousands of authors who have contradicting opinions. This paper aims to indentify controversial articles in Wikipedia. It draws clues from the edit history page in Wikipedia based on the traditional methods, and takes into account the contributors of the corresponding article to compute controversial scores. We also introduce a new intuitive evaluation method besides the PRF and NDCG evaluation metrics. Experiments on 16745 Wikipedia articles show that our methods perform much better than the other baseline models.
  • Information Extraction and Text Mining
    XU Xueke,TAN Songbo,LIU Yue,CHENG Xueqi
    2014, 28(4): 84-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this poster, we consider the problem of aspect-based extractive opinion summarization of online reviews. In additior to extracting aspect-relevant opinions as most existing approach do, we propose to further consider the requirements of informativeness, salience, and diversity in order to generate a high-quality summary. We proposed a unified summary extracting framework based on manifold ranking with sink points to address the three proposed requirements in a unified ranking process. Experiments with restaurant reviews show the reason-ability of the proposed requirements and effectiveness of the proposed approach.
  • Information Extraction and Text Mining
    DAI Min, WANG Rongyang, LI Shoushan, ZHU Zhu, ZHOU Guodong
    2014, 28(4): 92-97.
    Abstract ( ) PDF ( ) Knowledge map Save
    Opinion target extraction is an important sub-task of the sentiment analysis. This paper employs a supervised model to extract English opinion target with Conditional Random Fileds (CRFs). To better capture the relationship between the opinion targets and opinion expression, we add the syntactic features extracted from the parsing trees. In the experiments, two different data sets are used to evaluate the proposed approach. The experimental results demonstrate that using syntactic features is effective and it could improve the recall of opinion target extraction significantly.
  • Information Retrieval and Social Computing
  • Information Retrieval and Social Computing
    JI Zongcheng, WANG Bin
    2014, 28(4): 98-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    Community Question Answering (CQA) services have been building up large archives of question-answer pairs, which are organized into a hierarchy of categories. To reuse the invaluable historical question-answer pairs, it is essential to develop effective Question Retrieval (QR) models. In this paper, we propose a novel approach based on category prior of questions within the language modeling framework for improving the QR performance. Specifically, a new Language Model based on category prior is proposed which views the Leaf Category Language Model as the Dirichlet hyper-parameter that weights the parameters of the unigram Language Model. The approach has solid mathematic foundation. Experiments conducted on a large scale real world CQA dataset from Yahoo! Answers show that our proposed method can significantly outperform the previous work which just combines the category information with the unigram Language Model linearly.
  • Information Retrieval and Social Computing
    CHEN Weipeng, FU Ruiji, HU Yi, QIN Bing, LIU Ting
    2014, 28(4): 104-110.
    Abstract ( ) PDF ( ) Knowledge map Save
    Long queries refer to complex queries submitted by users. Current search engines good at keywords matching will return limited results if all words in the long queries are matched as keywords, often only very limited results are returned. In this paper, we attempt to improve the retrieval results by using the association between the words to delete the words which offer little information. In our experiments, two aspects of evaluation,“machine-oriented” and “user-oriented” are used. In the “machine-oriented” evaluation, the highlight ratio and the result number of related documents is considered. In the “user-oriented” evaluation, the retrieval results are evaluated by a human judger. The experimental results show that our method can significantly improve the quantity and quality of search results.
  • Information Retrieval and Social Computing
    ZHANG Chao, KONG Fang, ZHOU Guodong
    2014, 28(4): 111-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Interactive Question Answering (IQA), a hot research topic in the area of QA, can interact with users to process a series of questions from users just like talking to them. This paper systematically explores anaphoricity determination for coreference resolution in IQA. The statistic of the corpus shows the distribution of anaphoricity and the experiment in the TREC QA questions set which uses the rule-based and flat feature-based method shows the performance of anaphoricity determination for coreference resolution in IQA. On the basis of the characteristic of IQA, two flat features about proper noun are proposed. Experimental results show that the proper method and the proposed feature is effective.
  • Information Retrieval and Social Computing
    YAN Jun, LIU Wenfei, LIN Hongfei
    2014, 28(4): 117-122.
    Abstract ( ) PDF ( ) Knowledge map Save
    Currently, music recommendation is a hot topic for lots of music sites, radio and other music medias. To address this issue, we take the social tags as the main resources of recommending methods, and map the tags into three semantic spaces which include genre, emotion and context. Then we calculate the similarity of users and tracks in each space. At last, we merge the similarities in three spaces with different methods to recommend the right tracks to users. The experiments show that the recommending method of merging different spaces similarities gets a good result.
  • Information Retrieval and Social Computing
    LIU Quanchao, HUANG Heyan, FENG Chong
    2014, 28(4): 123-131.
    Abstract ( ) PDF ( ) Knowledge map Save
    Public opinion analysis for micro-blog post is a new trend, wherein sentiment orientation identification on micro-blog topic is a hot issue. According to the features of contents and the various relations of Chinese micro-blog post, we construct the dictionaries of sentiment words, internet slang and emotions respectively, Then we implement the sentiment analysis algorithms based on phrase path and the multi-feature of sentiment orientation of micro-blog topics. Using micro-blogs forwarding and commentaries, we take a future step to optimize the algorithm based on the multiple features. According to the experimental results, the values of the Precision and F-measure reach 85.3% and 79.4%, respectively.
  • Minority Language Information Processing
  • Minority Language Information Processing
    Wanmezhaxi, Nimazhaxi
    2014, 28(4): 132-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper analyses Tibetan word formation rules, syntactic structures, adjacent Part-Of-Speeches, the pattern of the suffix character as well as the usage of case-auxiliary words. Focusing on the processing of out-of-vocabulary words, abbreviations and overlapping ambiguities, three methods are proposed as the re-combination method the exclusion-restoration method, and the POS rule method, respectively. Experiments on a 1M Tibetan corpus of literature, poetry, medicine and news indicate the precision of the above methods are 99.84%, 99.95% and 92.02%, respectively.
  • Minority Language Information Processing
    Turdi Tohti, Akbar Pattar, Askar Hamdulla
    2014, 28(4): 140-144.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the text classification based on machine learning, the Uyghur traditional segmentation shows its deficiencies and limitations obviously. This paper uses another Uyghur automatic word segmentation method named as dme-TS. This segmentation method, no longer uses inter-word space as natural delimiter, but uses a kind of combination statistics (dme) to estimate the agglutinative strength between two adjacent Uyghur words, with the weak dme position as a segmentation point, The experimental result shows that, dme-TS can reduce the dimension of the feature space, at the same time also can effectively improve the classification performance of the tradition algorithm with the word for the features.
  • Text Infformation Processing
  • Text Infformation Processing
    QIU Mingfeng
    2014, 28(4): 145-152.
    Abstract ( ) PDF ( ) Knowledge map Save
    By analysing the situation of calligraphy font database, this paper introduces the processes of the creation of calligraphy typehead, the binaryzation of paper calligraphy typehead, the image segmentation of digital calligraphy, as well as the description and encoding of curve lineament. It also disucsses the design of calligraphy font database with addition of scripts of continuous strokes, methodicalness and styles etc based on the OpenType. The calligraphy font database is finally generated by emplying the font editing software and script tools. The proposed method provides a solution to the calligraphy font deficiency in web page and mobile browsing, contributing to the inheritance and development of the China's traditional culture of calligraphy in the field of electronic information exchange.
  • Text Infformation Processing
    WU Xie, LU Yuping, WANG Minggui
    2014, 28(4): 153-158.
    Abstract ( ) PDF ( ) Knowledge map Save
    To study the ancient Yi language according to Yi Ancient Books, It is essential to deal with the character set of Ancient Yi at the very beginning. Meanwhile, this work also provides the foundations for information technology development of ancient Yi language. To achieve the goal, it is necessary to maximize the collection of ancient Yi characters, and adopt comments and suggestions of Yi experts. After a detailed identification, sorting and selection, variants of ancient Yi characters are excluded and standards on ancient Yi are suggested for . its vocabulary, font, pronunciation and order.