2017 Volume 31 Issue 2 Published: 15 April 2017
  

  • Select all
    |
    Article
  • Article
    LI Hongzheng, JIN Yaohong
    2017, 31(2): 1-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an important type of phrase, prepositional phrases (PP) are widely distributed in Chinese, Therefore proper identification of PPs has positive and important impacts on the various tasks and applications in the field of Natural Language Processing. This paper surveys related studies in identifying Chinese PPs in recent years, and discusses the works in detail from several perspectives: research objects, experimental evaluation and research methods. It finally concludes several features of research on Chinese PP identification and proposes several suggestions on the future work.
  • Article
    ZHAO Qingqing, SONG Zuoyan
    2017, 31(2): 11-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the qualia structure and semantic type proposed in the Generative Lexicon Theory (GLT), we annotated the disyllabic metaphorical noun-noun compounds in Mandarin Chinese to explore the qualia roles involved in the process of metaphor and the correlation between the semantic type of the wnstituent morphemes and that of the compounds. We found that: the formal role is the most frequently involved in the process of metaphor, which is motivated by the nature of human cognition. On the other hand, natural types tend to involve the constitutive role and the artifactual types are likely to involve the telic role. Moreover, the semantic type of morphemes can predicate the semantic type of compounds.
  • Article
    SONG Rou,GE Shili,SHANG Ying,LU Dawei
    2017, 31(2): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    In text information processing, clause is regarded as the basic unit and sentence the compound unit. Thus far, a lack of operational definitions for these two concepts hinders the development of Chinese information processing. This research defines sentence as Sufficient Generalized Topic Structure roughly and clause as Topic Sufficient Clause based on it. Both definitions are put forward with linguistic and cognitive foundations.
  • Article
    LIU Hao, HONG Yu, Yao Liang, LIU Le, YAO Jianmin, ZHOU Guodong
    2017, 31(2): 25-35.
    Abstract ( ) PDF ( ) Knowledge map Save
    Identifying and locating domain-specific bilingual websites is a crucial step for the Web-based bilingual resource construction. However, the quality of sentence pairs varies among different bilingual websites. In contrast to the existing method focusing only on the sentence internal features, we explore the sentence pairs' origin information for identifying and filtering the low-quality sentences pairs. We hypothesize that, if a website is authoritative in the target domain, it tends to contain more high-quality sentence pairs. Thus, we propose a HITS based optimization method for mining domain-specific bilingual sentence pairs. In this method, we first construct a directed-graph model based on the link-info among the websites. Secondly, we propose a HITS based method for evaluating the authority of websites. Finally, we only extract the sentence pairs from the authoritative websites, and use them to enlarge the training-set of our machine translation system. Experimented on the education domain, our system achieves improvements of 0.44% BLEU score compared with existing method. A further proposed GHITS method achieve additional improvements of 0.40% BLEU score.
  • Article
    XIAN Yantuan, YU Zhengtao, HONG Xudong, ZHANG Lei, GUO Jianyi
    2017, 31(2): 36-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    A collaborative entity disambiguation method based on weighted feature overlap relatedness is proposed in this paper. This method make use of weighted feature overlap relatedness for computing the similarity between entity names. We define some deferent similarity formulas for computing entity similarity matrix, then the affinity propagation clustering algorithm is used to get the disambiguation results. Evaluation on the CLP-2012 corpus shows that our method can achieve competitive performance, attains 84.01% precision, 87.75% recall and 85.65% F-score.
  • Article
    ZHANG Yapeng,YE Na,CAI Dongfeng
    2017, 31(2): 42-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    In many domains, the performance of fully automatic machine translation is still not satisfactory. In order to obtain error-free translation, human translators need to perform post-editing on the output of automatic translation systems. Under the framework of interactive machine translation, the translation system and the translator work collaboratively. The translator validates the longest correct prefix in the translation provided by the system, and the system predicts the suffix to complete the sentence. On the basis of phrase-based translation model, this paper built an interactive machine translation system. Considering the characteristics of interactive machine translation, syntactic subtree information is used to guide the extension of translation hypotheses. Experiments show that this method can effectively reduce the interaction time between human and the computer.
  • Article
    LIU Ying, CAO Xiang
    2017, 31(2): 49-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    We propose a method to translate English into Chinese name using the search engine. The method makes use of supporting word, co-occurrence rules of English and Chinese name, transliteration similarity and translation probability. First, the translation candidates of English names are obtained by means of the search engine. We use the name tagging results, supporting words, co-occurrence rules of English-Chinese name and the length of syllable to obtain translation candidates from online corpus. Supporting words help to search more correlative names. Co-occurrence rules and the length of syllable make translations of an English name follow the regularities of co-occurrence and transliteration. Then the translated candidates are sorted according to transliteration similarity and the translation probability. English names are almost translated according to their pronunciations and the transliteration similarity help to judge the similarity of their pronunciations. We use the translation probability to obtain the translation likelihood of two words statistically. The experimental results show supporting word, co-occurrence rules, transliteration similarity and translation probability are all positive to improve the precision of name translation.
  • Article
    LIU Shuangjun, JIN Xiaofeng, CUI Rongyi
    2017, 31(2): 55-60.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a pitch-based automatic recognition method of China’s Korean, Republic of Korea and DPRK Korean dialects. Firstly, the shifted delta coefficients of pitch is extracted as feature parameter because of its strong discriminability. Secondly, the layered SVM algorithm and a voting mechanism are adopted to get the optimal classification result. Experimental results show that the recognition rate of the proposed method is better than conventional method based on shifted delta cepstral coefficients.
  • Article
    LIU Huidan, HONG Jinling, NUO Minghua, WU Jian
    2017, 31(2): 61-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    A large scale Tibetan text corpus is built, which includes 4.27 million sentences in 190 thousand documents, totaling 93 million syllables. Some predefined rules are applied to check whether there are spelling errors, detecting altogether 9 700 misspelt syllable types out of the 20 743 types of Tibetan syllables occurred in the corpus (covering 46.762 8%). But at the token level, the corpus has a very high quality, with only 27 427 misspelt syllables, roughly 0.030 8% of the total 93 million syllable tokens. Further analysis shows that there are mainly four causes leading to those spell errors: extra vowel sign(s); absence of syllable delimiter or sentence delimiter; characters which can be written in different forms; similar characters.
  • Article
    LI Yachao, JIA Yangji, JIANG Jing, HE Xiangzhen, YU Hongzhi
    2017, 31(2): 71-75.
    Abstract ( ) PDF ( ) Knowledge map Save
  • Article
    LIU Wei, WANG Xu, ZHANG Yujia, LIU Zongtian
    2017, 31(2): 76-85.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event-based text corpus is the foundation for the research on detection, representation, reasoning and exploitation of events in the Semantic Web. This paper proposes an automatic-annotation method for event-based texts to construct large-scale emergencies news corpus. Firstly, this paper presents an event structure model as event-based knowledge unit; Secondly, on the basis of text process by LTP , we apply the PrefixSpan to mine the rules of event elements from small-scale available corpus; Thirdly, by combining a customized dictionary of event elements, the denoters are expanded by Tonyici Cilin (Extended). In the experiment, the automatic annotation method is compared with manual tagging method and Stanford CoreNLP NER, showing that this method can improve the efficiency of event-based text annotation effectively.
  • Article
    LIU Bingyang1,2 , WU Dayong, LIU Xinran, CHENG Xueqi
    2017, 31(2): 86-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    Supervised character sequence labeling model is a popular method in Chinese named entity recognition(NER) task. It is found in practice suffering from word boundary error, covering roughly 47.5% of all errors. This paper incorporates global words boundary features in averaged perceptron model. Experiments indicate that the F value of recognizing people name, location names and organization names is improved by 0.04, reducing the proportion of boundary errors in overall errors.
  • Article
    WANG Dandan, CHEN Qingcai, WANG Xiaolong, TANG Buzhou
    2017, 31(2): 92-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Macro feature extraction methods are a typical feature extraction methods for text categorization. These methods fall into two categories: supervised macro feature extraction and unsupervised macro feature extraction. In this paper, we study the effect of the fusion of the two categories of macro features, which are both proved positive to text categorization. In particular, two types of supervised macro features and three types of unsupervised macro features are taken into account. Experiments conducted on three corpora, including two public corpora (i.e., Reuters-21578 and 20-Newsgroup) and one automatically constructed corpus, show that the fusion of supervised and unsupervised macro features is more effective than using any of them individually.
  • Article
    ZHAO Jingsheng, ZHANG Li, ZHU Qiaoming, ZHOU Guodong
    2017, 31(2): 99-106.
    Abstract ( ) PDF ( ) Knowledge map Save
    Through the technology of natural language processing and complex network analysis, the social networks in Chinese literature are extracted and analyzed. From the “Romance of the Three Kingdoms”, as an example, this paper extracts the social networks, with nodes as novel characters, edges as the connections between the characters, and weight of the edges as the co-occurrence times the characters. The social networks are then analyzed for the node degree distribution, centrality, clustering characteristics, etc. The results show that the characters in Chinese literature have obvious small-world and limited power-law distribution. Again in “Romance of the Three Kingdoms”, characters distribution have clear community characteristics, as well as versatility and diversity.
  • Article
    QIU Peiyuan, ZHANG Hengcai, YU Li, LU Feng
    2017, 31(2): 107-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog messages usually contain a great amount of real-time traffic information which can complement the sensor based traffic information collecting technologies. In this paper, we propose an automatic event labeling method to extract traffic information from microblog messages. Specifically, we apply the spatial relation identification between geographic entities in event extraction to determine the spatial elements in traffic event messages. Firstly, a conditional random field model is used to label the event role in the message texts. Secondly, the relations between the roles and the relations between the elements are tagged by SVM models. The experiment on Sina microblogs shows the precision and recall of the proposed approach are both over 90%, which is superior to the well-known pattern matching method.
  • Article
    CHEN Yadong, HONG Yu, WANG Xiaobin, YANG Xuerong, YAO Jianmin, ZHU Qiaoming
    2017, 31(2): 117-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Event extraction aims at detecting certain specified types of events that are mentioned in the source language data. Existing methods based on supervised learning often suffer from date sparseness and imbalanced distribution, producing low recall as a reuslt. In this paper, we investigate the frame semantic knowledge to improve event extraction. Taking the frame type as general feature and mapping the frames into events, we combine the event recognition model with the frame recognition model for a joint decision. Compared to the previous event recognition model, experiments show that this method achieves 6.44%(5.74%) gain in recall and 1.45%(0.83%) gain in F1 for the task of trigger (event) identification.
  • Article
    LIU Meiyan, HUANG Gaijuan
    2017, 31(2): 126-131.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we designed a content security orieuted filtering model. This model adopts two-tier filtering mode, and the first tier is a subject-based text filtering, and the second tier is a tendency text filtering for the relevant texts. With sentence as the basic processing unit, This model adopts dependency parsing to get the semantic framework. By combination with the semantic orientation based on HowNet with the semantic framework, the harmful information can be identified and filtered at a better accuracy and efficiency.
  • Article
    XU Anying, JI Zongcheng, WANG Bin
    2017, 31(2): 132-138.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, with the popularity of the internet and the explosive growth of the knowledge, community question answering websites had accumulated a large number of users and content, and generated a large amount of low quality text. It had greatly adverse effect for users to retrieve correct answers. Most present work about answer quality prediction in community question answering used the pointwise method to train a classification model. However, different questions have different difficulties and thus have different requirements of their answers. In addition, some of the answers teatures can not be easily characterised by the pointwise method. Therefore, this paper used the pairwise method to predict answer quality. Moreover, previous work has shown that the number of answers in one question is useless, even reduncdant for predicting the answer quality in community question answering. The conclusion is same for the time difference factor. This paper combines these two features into one new feature. Experimental results show that the new feature can significantly improve the prediction performance.
  • Article
    LIU Cheng, SHA Ying , JIANG Bo, Guo Li
    2017, 31(2): 139-145.
    Abstract ( ) PDF ( ) Knowledge map Save
    Various types of account tend to be existed in Social network, including normal individual users, online water army, zombie fans, official organizations and so on. We define the individual accounts whose behavior is rendered as organizational characteristic as impli-cit organization. With a team responsible for the operations, the impli-cit organization account bears no individuals' behavior pattern, but falls in the pattern of an official organization. The effective discovery of implicit organizations have important significance for analysis of public opinion trends in the spread of social networks, advertising recommendations and so on. This paper, taking the data of SinaWeibo as an example, investigates the classification of the individuals and the implicit organizations. We manually labeled a total of 583 accounts, and summarizing 22 related features to build a Naive Bayes model and a decision tree model. Experiments demonstrate an effective identification of implicit organization by 86.4% precision.
  • Article
    LI Hui, MA Xiaoping, SHI Jun, ZHONG Zhaoman, CAI Hong
    2017, 31(2): 146-153.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the rapid growth of microblogs, bloggers are facing difficulties in locating the microblogs they are interested. To deal with this information overload, various approaches including messages filtering, recommendation and searching have been investigated. Focusing on recommending bloggers or microblog posts by the trust model and the social relationship, this paper applies LDA topic model and Matrix Factorization to infer the topic distribution of microblogs and the user interest. According to the experimental results, the proposed method can effectively solve the personalized recommendation of microblog.
  • Article
    ZHUAN Yue, XIONG Jinhua, CHENG Xueqi
    2017, 31(2): 154-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    To take full advantage of users social characteristics and address the diversity of tag recommendation, we present a method for user tag recommendation, aiming to combine users social characteristics and the diversity of tag recommendation. We use topic model to get a users potential semantic topics from his tweets, and then cluster the users followed by this user, i.e. using the potential semantic topics to divide the users into different areas. Each area can reflect the interest that attracts the user to follow. We select several representative tags by sorting the tags in the area based on TF-IDF. Then, we combine and sort different areas of representative tags to get top-K tags for recommendation. Experiment shows that our approach not only can recommend diversity tags but also reflect the users interest and hobbies.
  • Article
    XIAO Liumingjing, ZHOU Zhi, ZOU Xiaojun, HU Junfeng
    2017, 31(2): 163-168.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the growing amount of manuscripts, reviewer assignment becomes an increasingly laborious task for conference organizers, journal editors and grant administrators. To develop a computer-aided reviewer assignment for this purpose, the measurement of relevance between manuscripts and reviewers is a key issue. This paper presents a domain ontology based relevance measurement method. This method includes keywords extraction of the manuscript, domain ontology mining and manuscript-reviewer relevance measurement based on the network flow algorithm. Preliminary experiments show that this method performs well in the task of domain assignment of the NSFC proposals, and outperforms string similarity based method.
  • Article
    REN Juwei, YANG Liang, WU Xiaofang, LIN Yuan, LIN Hongfei
    2017, 31(2): 169-178.
    Abstract ( ) PDF ( ) Knowledge map Save
    Microblog is a large and complicated public opinion platform on the Internet. In this paper, we demonstrate how microblogs can be used to predict real world public sentiment trends of events. Firstly, considering the special properties of microblogs, absence of context and sparseness of feature, we use the hyponymy relationship between words to do semantic extension for each microblog. Secondly, with the help of semantic feature and affective commonsense knowledge, we can decide the sentiment of each microblog through constructing a double-layer text classifier. Finally, public sentiment trend prediction of each event is performed by using time series sentiment analysis of microblogs. The experiment results show that our sentiment analysis method has a better performance than state-of-the art classification methods. Besides, the sentiment trends of events are consistent with the development of the real world situation to a large degree.
  • Article
    XU Shuaishuai, DAI Xinyu, HUANG Shujian, CHEN Jiajun
    2017, 31(2): 179-186.
    Abstract ( ) PDF ( ) Knowledge map Save
    The valuable microblog comments can be supplied to the readers, or be provided to some tasks like public opinion analysis and text mining. To detect such valuable comment, this paper presents an unsupervised comments analysis method. Firstly, we use the search engine to expand the microblog text. Secondly, we use the correlation measure to get the most valuable comments and the most invaluable comments, respectively. Finally, we generate a comment classification model to assess the comment value. The experimental results show our method performs well on the task of valuable comments recognition.
  • Article
    ZHAO Yanyan, QIN Bing , SHI Qiuhui, LIU Ting
    2017, 31(2): 187-193.
    Abstract ( ) PDF ( ) Knowledge map Save
    Rapid development of social media, such as Micro-blog, brings lots of information as well as challenges for sentiment analysis. The limited size of Chinese sentiment lexicon is one critical influence on the performances of sentiment analysis. This paper proposes a simple statistical method to mine large amounts of sentiment words or phrases to construct a large scale 100,000 words/phrases from microblogs. We apply this large-scale lexicon to Chinese microblog sentiment classification, and the results confirm a clear performance improvement.
  • Article
    PENG Min,XI Junjie,DAI Xinyuan,HE Yanxiang
    2017, 31(2): 194-203.
    Abstract ( ) PDF ( ) Knowledge map Save
    Collaborative filtering achieves personalized recommendation based on the similarity between items or users. However, the data sparseness affects the calculation of similarity, leading to a low recommendation accuracy. Most of the traditional recommendation algorithms only consider the rate matrix between users and items, while ignoring the item reviews generated by users, that offer valuable information about the users preferences to different attributes of the items. In this paper, we proposed a novel recommendation algorithm, called SACF (sentiment analysis collaborative filtering), which considers the impact of the review texts on the prediction of final score of items. By incorporating LDA topic model, SACF can extract K latent attribute aspects of the items and compute the user similarity according to the sentiment tendency in such attribute aspects. Our experimental results on Jingdong review dataset demonstrate that, the proposed method can not only alleviates the problem of data sparseness in collaborative filtering scheme, but also improves the recommendation accuracy.
  • Article
    MA Chunping, CHEN Wenliang
    2017, 31(2): 204-211.
    Abstract ( ) PDF ( ) Knowledge map Save
    Recommender system is widely used in e-commerce web sites. Traditional recommendation algorithms, e.g. collaborative filtering, predict the degree of user preference to an item based on user scoring history. Due to the development of the Internet, e-commerce websites pay more attention to user interactions, which leads to a great deal of user generated contents like comments, geographic locations and social relationships. Compared to the user rating, user comment demonstrates their opinions on different facets of the item. By taking full advantage of user generated contents, user preference can be further discovered. In this paper, we proposed an approach to using word-embedding to analyze review comments and design a novel system to predict the scores. Empirical experiments on a large review dataset show that the proposed approach can effectively improve the precision of the recommender system.
  • Article
    XU Sukui, DAI Lirong, WEI Si, LIU Qingfeng, GAO Qianyong
    2017, 31(2): 212-219.
    Abstract ( ) PDF ( ) Knowledge map Save
    Two methods under the deep neural network acoustic modeling framework are proposed to improve the estimation of posterior probability for evaluation of pronunciation of freely-spoken speech: 1) the posterior probability is re-estimated with more accurate recognition results by employing RNN language model to re-score the N-best candidates produced from the first decoding process; 2) the influence of dialect to posterior probability is taken into account by involving likelihood scores produced by dialect clustered nodes added to deep neural network acoustic model which is re-trained as a multi-lingual style. Experimental results show that these methods increase the correlation (between posterior probabilities and human scores) for 3.5% and 1.0% respectively, and the combination of these two methods achieves 4.9% increase. In a real evaluation task, a 2.2% absolute improvement is observed in correlation between machine scores and human scores.
  • Article
    PIAO Mingji, CUI Rongyi
    2017, 31(2): 220-225.
    Abstract ( ) PDF ( ) Knowledge map Save
    A PCA based character level script identification method is proposed to identify Korean, Chinese and English scripts in a image. First, the space of eigenvectors is constructed by using PCA, and the segmented character was reconstructed by projecting into the space. Second, relative entropy of vertical and horizontal histograms between the original and the reconstructed image is calculated. Finally, according to Euclidean distance and relative entropy between the original and the reconstructed image, the script is identified. The experiment results show that the proposed method achieves 99.78% accuracy under fully correct wrong segmentation, which successfully addresses the script identification problem in Korean, Chinese and English multi-lingual document image.