2012 Volume 26 Issue 6 Published: 14 December 2012
  

  • Select all
    |
    Review
  • Review
    YE Na, ZHANG Guiping, HAN Yadong, CAI Dongfeng
    2012, 26(6): 1-11.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the impracticability of machine translation result, computer-assisted translation technology became a hot research topic and achieved good performance in practice, leading to a substantial improvement of translation productivity. With the increase of the scale of computer-assisted translation, organizing multiple spatially distributed users to complete one translation task has become a common phenomenon. The new translation mode is called collaborative translation. This paper presents a survey on the computer-assisted translation and the collaborative translation. Firstly, focused on the three key problems of aiding translation generation, post-editing and feedback learning, the major approaches and the latest research progress of computer-assisted translation are introduced. Then the paper discusses the ties and differences between computer-assisted translation and collaborative translation, with an.elaboration on the major challenges in collaborative translation and the current solutions. Finally, the paper outlooks the prospects for the future research of collaborative translation.
    Key wordscomputer-assisted translation; collaborative translation; user; aiding translation; post-editing
  • Review
    CHEN Chen1, ZHAO Tiejun1, LI Sheng1, YANG Muyun1, QI Haoliang2
    2012, 26(6): 11-19.
    Abstract ( ) PDF ( ) Knowledge map Save
    Most of the previous studies on the personalized search are generally designed for all queries, and few have tried to answer which queries can benefit from personalization. In this paper, we mine linguistic knowledge from the large-scale human knowledge base to predict query potential for personalization. The acquired linguistic knowledge includes conceptual terms, ambiguous terms and synonymous terms, which are adopted to design corresponding features for predictive models. The knowledge mined from Wikipedia alleviates the data sparseness of query logs. The experiment results indicate the effectiveness and feasibility of our approach.
    Key wordsquery potential for personalization; linguistic knowledge; query logs
  • Review
    MA Hongyuan1,2, WANG Bing1
    2012, 26(6): 19-27.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query results caching and prefetching are crucial to the efficiency of Web search engines. This paper pre-sents a novel approach tailored for query results caching and prefetching based on the user characteristics. We describe an analysis of query logs originated from a famous Web search engine, and design a query results prediction model for prefetching and to partition the cache exploiting the characteristics of the users. We then use a real large scale query logs of 2-months to evaluate the approach, in contrast to the traditional methods and theoretical upper bounds. Experimental results show that this approach can achieve 3.03% to 4.17% increase for all requests as compared with state-of-the-art methods, and 20.52% to 28.2% increase for requests from the special users group who contributes most to Web search engines.
    Key wordsquery results cache; user characteristics; performance optimization
  • Review
    WEN Kunmei, XU Shuai, LI Ruixuan, GU Xiwu, LI Yuhua
    2012, 26(6): 27-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog is a new social network developed in the Web2.0 era, with the simple and quick operation for a post anytime and anywhere through the interaction form. These features make Microblog boom with a highlight in the Internet since 2006, when the Obvious company of the United States launched the worlds first Microblog service named Twitter. This paper firstly introduces the state-of-art research on Twitter, including 1) feature analysis on Microblog social network, e.g. the structure of Microblog users network, the Microblog users impact analysis and the data diffusion mechanics in the information network; 2) semantic analysis, i.e. emotional semantic analysis on Microblog; 3) related applications in Microblog, e.g. event monitoring and warning, security, privacy and real time search. Then we summarize the research on Chinese Microbolg, including the feature and knowledge discovery of Chinese Microblog, and the differences between English and Chinese Microblog. Finally, we discuss the problems in the future research on Chinese Microblog.
    Key wordsTwitter; Chinese microblog; information process
  • Review
    LV Shaohua, YANG Liang, LIN Hongfei
    2012, 26(6): 38-45.
    Abstract ( ) PDF ( ) Knowledge map Save
    Sentiment classification aims to give the orientation of the review. Much work has been done in opinion mining in a special domain and their results show that the supervised methods performs well. However,such built models are not so good when directly applied to heterogeneous domains. Therefore, the cross domain sentiment classification are currently emphasized so as to predict the opinion of the unlabelled review in one domain by making use of the labeled text from another domain. For this purpose,this paper proposes an algorithm via SimRank to connect the source domain and target domain via the common words between them to build latent emotional space with the help of sentimental dictionary. Thus it enable the prediction of the target review by the model trained on the labeled source domain via SVM. Experimental results show the validation of this method.
    Key wordscross domain; sentiment classification;SimRank;SVM
  • Review
    LI Jing, LIN Hongfei, LI Ruimin
    2012, 26(6): 45-51.
    Abstract ( ) PDF ( ) Knowledge map Save
    Music emotion tag prediction algorithm plays an important role in music sentiment analysis. This paper presents a sentiment vector space model (s-VSM) based music emotion tag prediction algorithm. Firstly, we extract the emotion words to build the sentiment vector space model. Then, we use SVM classifier to generate training samples, and to get the collection which shares the same main emotion category with the predicted music. Finally, by finding the nearest k songs, we can get the emotion tag for recommendation. Experimental results show that s-VSM and the “emotional words-emotional label” co-occurrence based feature reduction method perform better than traditionally word-based vector space model in mood classification. Meanwhile, the emotion tag prediction based on the result of classification can effectively prevent the music “main mood drift”, thus achieving better tag predict accuracy than k-nearest neighbors method.
    Key wordstag prediction; feature reduction; mood classification; sentiment vector space model
  • Review
    ZHANG Ning, KONG Fang, LI Peifeng, ZHOU Guodong, ZHU Qiaoming
    2012, 26(6): 51-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    In event anaphora resolution, the antecedent of the anaphor is an event and the anaphor is a noun phrase. They are parts of different semantic categorization systems, and thus most of features applied in entity anaphora resolution are not appropriate for event anaphora resolution. This paper proposes an event pronoun resolution framework via a machine learning approach. The instances creation and the features selection are presented in detail. This paper also provides the experiment results on OntoNotes 3.0 corpus, confirming pretty good F-measure of the framework.
    Key wordsevent pronoun resolution; machine learning approach; instances creation; features selection
  • Review
    LIU Weiwei , JI Lixin , LI Shaomei, HE Zanyuan
    2012, 26(6): 59-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    The conventional NAPs training method of projection matrix requires laborious parameter tuning process over the training corpus with information labels. It cannot remove all unwanted information and result in loss of desirable information. To tackle these problems, a discriminating weighted nuisance attribute projection (DWNAP) is proposed in this paper. DWNAP quantitatively estimates the source of nuisance based on the normalized scatter of the given languages eigenvalues for discriminating weighting in training of projection matrix. Experiments on Chinese, Japanese and English show the advantage of the proposed DWNAP, with a relative reduction in the equal error rate (EER) for about 7.51% compared with the traditional NAP.
    Key wordslanguage recognition; mismatch compensation; discriminating weighted nuisance attribute projection(DWNAP); nuisance attribute projection(NAP)
  • Review
    ZHANG Kunli, ZHAO Dan, ZAN Hongying, CHAI Yumei
    2012, 26(6): 65-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    The adverbs of modern Chinese play complex syntax roles with strong individualized characteristics in their usages. Therefore, many research have been focused on the adverb usages. In this paper, we introduce a triune knowledge base (usage dictionary, usage rule and usage corpus ) of Contemporary Chinese adverbs that we have been finished. Based on this knowledge base, we first design usage rules to label the adverb usages in the corpus of Peoples Daily automatically, achieving an accuracy of 84.86%. Then we adopt statistical strategy to label the common adverbss usage with different feature templates, context window sizes and models. Experiment show that the statistical methods produce preferable results for the automatic recognition of adverbs usages.
    Key wordsautomatic recognition of adverb usage; adverb usage rule; conditional random fields; maximum entropy; support vector machine
  • Review
    ZAN Hongying, ZHOU Lijuan, ZHANG Kunli
    2012, 26(6): 72-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Conjunctions connect words, phrases, clauses, sentences and even sentence groups. The conjunction phrase is the words or phrases connected by conjuctions, bearing different lengths and relations. According to conjunction usage in the functional word usage knowledge base, the paper formulates a rule based method for the recognition of conjunction structure phrases. Meanwhile, the paper adopts the conditional random field to build a statistical model for the conjunction phrase recognition based on the conjunction usage. Results indicate that the statistical method performs better than the rule method, and conjunction usage is beneficial to the conjunction phrase recognition.
    Key wordsconjunction phrase; conjunction usages; conditional random fields
  • Review
    YUE Dapeng1, RAO Lan2, WANG Ting1
    2012, 26(6): 79-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Multi-document summarization aimed at minimizing unnecessary readings time is of great value nowadays. Considering that news today is usually arranged in topics, this paper takes this advantage and proposes a topic based multi-document summarization method employing MMR. This method treats key words of the topic description as the basis for sentences scoring, together with traditional features such as the sentence position. Experiments results on TDT4 corpus indicate that the proposed method performs better than 2 baseline systems, especially under the compression ratio of 5%.
    Key wordsautomatic summarization; topic; natural language process; news
  • Review
    LU Rong, ZHANG Yang, YANG Qing
    2012, 26(6): 85-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    The trend of news event in the social network can be tracked by the moving average of its frequency of occurrence. This paper presets a novel method to track the shorter trend and the longer trend of news by the moving average lines of two different sized time-window are adopted. The momentum of the news trend is defined as the difference between the shorter moving average and the longer moving average. When the value of the momentum is positive, the news are more likely to get hotter, and vice versa. Moreover, the change of the momentum value itself provides an even earlier indicator of the news trend. Experimental results show proves the proposed method is simple and effective in predicting the news trend in time and accurately.
    Key wordssocial-network; news trend predicting;moving average
  • Review
    YUAN Jipeng1,2, ZHANG Jin1, GUO Yan1, DAI Yuan1, LI Jing3
    2012, 26(6): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    The analysis of netizen importance enables us to discriminate important users from common ones, which provides grounds for the important user community analysis. With thorough study on the factors affecting netizen importance, we propose a netizen importance evaluation model (NI) based on an index system. In this model, netizens features on information publishing and interaction relation are taken into account, and the AHP method is employed to obtain the weight of the indexes. In our experiment on Twitter data, the results illustrate that the model identifies more valuable network users.
    Key wordsnetizen importance; netizen modeling; netizen evaluation index
  • Review
    XUE Ran, MA Jun, HAN Xiaohui, CHEN Zhumin
    2012, 26(6): 98-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an aging theory based method to detect hot events in Flickr data. For each Flickr photo, visual words are first extracted from it and then combined with the content of the attached as a document. An LDA model is trained to predict the topic distribution of each document, which is used as the final vector representation of the document. An improved single-pass clustering algorithm is then proposed to detect events, which take the geographic information of a photo into account. Then aging theory is used to model the life cycles of sequential detected events, determining the energy value of events in each time slot. Finally, hot events in a specific time slot can be detected by ranking the events in terms of their energy value. Experimental results from real Flickr data show that the proposed approach outperforms traditional event detection methods in terms of precision, recall, and F1 value.
    Key wordsevent detection; visual words; geographic information; LDA; aging theory
  • Review
    NI Yaoqun1,2,3, CAO Peng1,2, XU Hongbo1, TANG Huifeng3, CHENG Xueqi1
    2012, 26(6): 109-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Distinguishing Uyghur language from similar Arabic script languages such as Arabic, Kazakh, Kirgiz, etc. is an indispensable issue in Uyghur information processing. The paper builts a n-gram based Uyghur language discrimination model over an optimized Uyghur character encoding schema for an accuracy over 98%. The analysis reveals the misestimated texts are centered around the forum posts and microblogs because of their extremely short length (often only a few words). Thus, the paper examines all common sub-strings among tokens appeared in web texts of the four languages and probes into the minimum string length required to determine its language.
    Key wordsArabic-Script Uyghur;language detection;longest common substring
  • Review
    WANG Chao, ZHU Tong, LIU Yiqun, MA Shaoping
    2012, 26(6): 116-121.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a sub-topic of query intent classification, increasing researches have been devoted to the query ambiguity. Most of them are focused on content-based ambiguity of search querys, neglecting the ambiguity of user needs type. In this paper, a query needs type classification model is proposed based on the study of such ambiguity. We build a system of user needs type using web directory and test it on a large commercial search engine log. This article contribue a practical way to classify query needs type, which facilitate the search engine to improve by adjusting ranking algorithm in accordance to the different user needs.
    Key wordsquery ambiguity; query intent classification; needs type
  • Review
    MA Bin, HONG Yu, LU Jianjiang, YAO Jianmin, ZHU Qiaoming
    2012, 26(6): 121-129.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog is a novel individual publication model over Internet, making significantly more information open and interactive. Utilizing topic detection techniques to classify and organize microblog texts by topics can enable users access to the information interested to them under the dynamic environment. To deal with the short, semi-structured, context dependent microblog texts, we propose a thread-based two-stage clustering method. In the first phase, the temporal-author-topic (TAT) model is applied to clean the thread, namely to filter out the noisy microblog texts. In the second phrase, microblog texts with each thread are merged to form the thread texts for global topic detection. Experimental results show the approach achieves a good performance with a F-measure of 31.2%.
    Key wordsmicroblog texts; topic detection; TAT model; thread information; LDA feature selection