2016 Volume 30 Issue 6 Published: 15 December 2016
  

  • Select all
    |
    Review
  • Review
    SUN Maosong ; CHEN Xinxiong
    2016, 30(6): 1-6.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper aims to address the necessity and effectiveness of encoding a human annotated knowledge base into a neural network language model, using HowNet as a case study. Traditional word embedding is derived from neural network language model trained on a large-scale unlabeled text corpus, which suffers from the quality of resulting vectors of low frequent words is not satisfactory, and the sense vectors of polysemous words are not available. We propose neural network language models that can systematically learn embedding for all the semantic primitives defined in HowNet, and consequently, obtain word vectors, in particular for low frequent words, and word sense vectors in terms of the semantic primitive vectors. Preliminary experimental results show that our models can improve the performance in tasks of both word similarity and word sense disambiguation. It is suggested that the research on neural network language models incorporating human annotated knowledge bases would be a critical issue deserving our attention in the coming years.
  • Review
    KANG Shiyong;ZHANG Chen
    2016, 30(6): 7-14.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the linking theory and event structure theory, this paper focused on the study of correlation analysis of lexical semantic categories, semantic roles and syntactic elements. We annotate the text in the Chinese textbooks of primary and middle schools, published by Peoples Education Press, to build our annotated corpus. Based on this corpus, we analyzed the correlation between lexical semantic categories and semantic roles, and summarizing the correlation characteristics of each lexical semantic category. We hope this study could benefit the study of automatic syntactic parsing and semantic analysis.
  • Review
    PARK Minjun; YUAN Yulin
    2016, 30(6): 15-25.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this study, a propositional representation model for Chinese BI(比)-structure is described. The model is based on 7 types of Comparative Elements (CEs), enhancing the resolution of existing 5 CE-based framework in analysis of comparatives. The model is also fully visualized by the relational structure of two propositional descriptions, based on which we reveal three basic patterns of comparison and define the standard of asymmetrical comparison explicitly. Consequently, it provides intuitive and easy way to analyze complex, multi-layer predications embedded in BI-structure, which are mostly elusive and tricky part of the comparative relation extraction. Moreover, the model is compatible with the OWL ontology language due to its basis of propositional logic. Accordingly, a small-scale ontology is built to demonstrate automatic relation extraction of BI-comparatives.
  • Review
    TIAN Yuanhe; LIU Yang
    2016, 30(6): 26-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the previous researches in sense prediction of Chinese unknown words, the lexical knowledge related to word-formation has been used but not regarded as a valuable form of knowledge representation. This paper, on the basis of the morphemic concepts, provides a multi-level solution to knowledge representation of Chinese unknown words. A model based on Bayesian network is also constructed to analyze semantic word-formation of Chinese unknown words, effectively predicting the multi-level lexical knowledge of Chinese unknown words. This kind of lexical knowledge representation is simple, intuitive and easy to expand. Experimental results show that, this knowledge representation is of important value in sense guessing of Chinese unknown words, and can meet the application needs on different levels.
  • Review
    LIN Ju; XIE Yanlu; ZHANG Jinsong; ZHANG Wei
    2016, 30(6): 35-39.
    Abstract ( ) PDF ( ) Knowledge map Save
    Prosody boundary plays an important role in naturalness and intelligibility of verbal expressions. Thus, prosody modeling is also an important aspect of speech synthesis and understanding. Focused on the interaction of adjacent tones, we propose a method of prosody boundary detection based on tone nucleus and DNN model. This method calculates the boundary-related parameters by applying the tone nucleus features. Then, the parameters are modeled by the deep neural network. For comparison, the baseline system chooses syllable the acoustic feature. The experimental results show a relative 4% improvement achieved by the proposed method.
  • Review
    QIU Likun; HUANG Kun; HE Baorong; KANG Shiyong
    2016, 30(6): 40-48.
    Abstract ( ) PDF ( ) Knowledge map Save
    Negative expression plays an important role in deep semantic representation. Using corpus-based methods, this paper focuses on analyzing negative expressions and their usages in contemporary Chinese. First, we collect negative expressions and classifiy them into three types, i.e. explicit negatives, implicit negatives and negative constructions. Second, we analyze the rules of negative expressions, covering those used in modifying single-predicate structures, modality elements, predicate-complement structures, verbal coordinate structures, serial verb structures and pivotal sentences, we especially focus on discussing the effect of negative expressions in multi-predicate structures upon the meanings of propositions. The annotation scheme is also developed under the deep semantic representation framework. Finally, we investigate the distribution of negative expressions in multi-domain treebanks.
  • Review
    RAO Gaoqi; LI Yuming
    2016, 30(6): 49-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the diachronic corpus of modern Chinese newspaper across 70 years, statistical measures are applied to detect the state-steady words. Altogether, 3 013 words are decided as the candidates according to their corpus coverage, time sensitivity and diachronic classification. Among them, verbs and nouns cover one third, respectively, and the rest consists of adjectives and function words. The average word length is 1.7 characters, distributed within top 7 609 in frequency list, and covering 90% of corpus. Basic morphemes and core words shape the features of the set in POS and length.
  • Review
    MA Jianjun; PEI Jiahuan; HUANG Degen
    2016, 30(6): 59-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    The study on the automatic identification of English functional noun phrases (NP) may transform the task of resolving structural ambiguity caused by noun phrases into the task of NP chunking. Functional noun phrases refer to those noun phrases which are defined based on their syntactic functions in clauses. On a corpus of business domain, this study aims to identify both the scope of NP chunks and their syntactic function types by refining the Part-of-speech (POS) tagset, and adopting conditional random fields (CRFs) model combined with the semantic information. Modification to the Penn Treebank tagset is completed in the pre-processing, and semantic features are added to the CRFs model to improve the recognition of the adjunct types of noun phrases. Test results show that the system has achieved an F-score of 89.04% in the open test using our gold standard tags; and refining the POS tagset is a better approach for NP chunking, which has increased the F-score by 2.21%, compared with the model using the Penn Tree bank POS tags. This knowledge of English functional noun phrases is then combined with the NiuTrans SMT system, which slightly improves the English Chinese translation performance.
  • Review
    YE Dashu; HUANG Peijie; DENG Zhenpeng; HUANG Qiang
    2016, 30(6): 67-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper applies the product feature mining on a dialogue system of a mobile phone recommendation assistant, enhancing the focus of the system during the interaction. CBOW (continuous bag of words) language model is used to represent the sematic clue. A feature framework with exponential elongate static window is introduced to capture the import features among the interactions between words of variant distance. We finally utilize convolutional neural network (CNN) to perform product feature mining task. The word embedding representing sematic clue gives the relation between current word and the product feature, while the feature framework can alleviate the word ambiguity. The experiment shows that our model outperforms the state-of the act methods on product feature mining.
  • Review
    ZHAO Hongyan; LI Ru;ZHANG Sheng;ZHANG Liwen
    2016, 30(6): 75-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    Frame identification is a basic task of semantic role labeling, which assigns a correct frame to the labeled target word based on the semantic scene. At present, the state-of-the-art methods are primarily based on statistical machine learning, in which the performance heavily depends on the quality of the extracted features. This paper proposes a DNN based frame identification method, trying to capture the target word context automatically. Experiments on the Chinese FrameNet and the Peoples Daily(March, 2003) show 79.64% and 78.58% accuracy, respectively.
  • Review
    AN Bo; HAN Xianpei; SUN Le; WU Jian
    2016, 30(6): 84-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Triple classification is crucial for knowledge base completion and relation extraction. However, the state-of-the-art methods for triple classification fail to tackle 1-to-n, m-to-1 and m-to-n relations. In this paper, we propose TCSF (Triple Classification based on Synthesized Features) method, which can joint exploit the triple distance, the prior probability of relation, and the context compatibility between entity pair and relation for triple classification. Experimental results on four datasets (WN11, WN18, FB13, FB15K) show that TCSF can achieve significant improvement over TransE and other state-of-the-art triple classification approaches.
  • Review
    LI Bin; SONG Li; YIN Siqi; QU Weiguang; WANG Meng
    2016, 30(6): 90-99.
    Abstract ( ) PDF ( ) Knowledge map Save
    As an important theory in cognitive science, the prototype theory is plausible to use properties to distinguish the central and periphery members in a category. However, there’s no quantitative evidence to support the theory. In this paper, we apply the cognitive property bank including 230,000 “word-property” pairs to validate the theory via 3 categories: bird, fruit and transportation. The results show that in Chinese, the typical members of bird are sparrow and swallow, which share many properties with bird. While the penguin and ostrich share very few properties with bird, especially lacking the key property fly. The data in cognitive property bank basically supports the idea of the prototype theory, but we also notice that the little bird has many properties, which make it available for a typical member in the category. We also distinguish between the tree based ontology and graph based categorization by bipartite graph.
  • Review
    DU Jiali ; YU Pingfang
    2016, 30(6): 100-116.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on a time-restricted experiment in which 126 English major sophomores are required to decode 100 garden path sentences and control sentences, this article investigates the breakdown effect produced by Chinese English learners in the garden path sentence processing, quantifying of the intensity of breakdown effect, and making a comparative study against an machine translation system with the Stanford parser. Garden path phenomenon is a conscious and controlled behavior. The encoding and decoding reflect the phenomena of both processing breakdown and cognitive overload, as well as the complex psychological cognitive activities of human beings. The experiment proves that breakdown effects appear asymmetrically, with a top frequency and intensity occurred in the multi-category breakdown in contrast to the complementizer breakdown, object breakdown, embedded breakdown and multi-category breakdown. In the human computer comparative study, the machines program decoding and the learners cognitive decoding are not proved completely resonant or absolutely co-occurent.
  • Review
    YANG Siqin; JIANG Minghu
    2016, 30(6): 117-125.
    Abstract ( ) PDF ( ) Knowledge map Save
    Adopting Event Related Potential, measuring the reaction time, error rate and N400, this paper investigates whether the advanced Chinese-English bilinguals retrieve the second language when they are processing their mother language. The results reveal that the implicit conditions, the first English pronunciation did not reflect on the reaction time. In the ERP result, while bilinguals are confronted with semantic related judgments, N400 invoked by the language areas fails a significant difference from the implicit first English pronunciation. However, when faced with the semantic unrelated judgments, N400 shows significant difference between each implicit first English pronunciation conditions. It is concluded that when advanced bilinguals are making comparatively complex and semantic judgments, the second language can be unconsciously retrieved.
  • Review
    TIAN Kun; KE Yonghong; SUI Zhifang
    2016, 30(6): 126-132.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the process of semantic roles annotation, searching for similar annotated sentences is a common way to analyze such corpus. Existing methods cannot take full advantage of verbs and related elements, so they are unable to meet the demand of searching for similar annotated sentences. This article develops a new method to calculate Chinese sentence similarity focused on the verbs. Based on semantic roles annotation, the algorithm detects the similar sentences by analyzing the semantic roles, matching the annotated sentences, and calculating similarity between these matched sentences. To get a better result, the article also compares several other methods for word similarity, including algorithms based on How-net and Distributed Representation, and applies the best one into our algorithm. The experimental result indicates that the sentence similarity algorithm based semantic roles annotation performs better than traditional methods.
  • Review
    ZAN Hongying; XU Hongfei; ZHANG Kunli; SUI Zhifang
    2016, 30(6): 133-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of the Internet, the internet stangs are becomming common and now shangs are constantly apparing. To deal with this challenge for natural language processing tasks like sentiment analysis, product recommendation, QA, etc., an internet slang dictionary is necessary. This paper analyzes the problems encountered when collecting and annotating micro-blog texts, together with other internet resources, to build the dictionary and the related corpus. Further, the potential applications of this dictionary and the corpus is discussed.
  • Review
    WANG Shan; LIU Rui
    2016, 30(6): 140-146.
    Abstract ( ) PDF ( ) Knowledge map Save
    The construction of a speech corpus is the foundation of research on oral languages. In this paper, a small-scale corpus is constructed based on the representative talk shows, QiangqiangSanrenxing and LuYuYouyue. An annotation system constituted by 5 primary categories and 16 subtypes is developed to annotate the conversational structures. According to the statistics of conversational structures, there are 309 interrupted structures, 141 inserted structures, 111 repetitive structures, 653/589 question and answer structures, 51/21 obstruction-correction structures, which reflect the unbalanced distribution of the number of conversational structures. The form, nature and communicative tasks of the talk shows are the main influencing factors of the distribution of the conversational structure. In addition, conversational structures show certain patterns, and therefore trigram analysis is carried out to explore the combinations. It is found that the highest frequency combination in the corpus is the question-answer adjacency pair, in addition to a large number of contingency combinations.The combination patterns of conversation structures not only reflect the style of the talk shows, but also help to analyze the functional modules in the conversation, the formation of conversation strategies, and thus help us more deeply understand the operational mechanisms of the conversation.
  • Review
    LU Dawei; WANG Xingyou; YUAN Yulin
    2016, 30(6): 147-155.
    Abstract ( ) PDF ( ) Knowledge map Save
    The semantic knowledge resources containing extensive linguistic information are one of the important interfaces of linguistics and language engineering. In this paper, we study the automatic expansion of semantic knowledge resources by the example of the Adjective Syntactic-Semantics Dictionary. We aim to extend the vocabulary of the dictionary and their syntactic patterns via the large corpus. More specifically, our method is to classify the words in dictionary into 97 categories by their syntactic patterns, and mapping the new words which are not existing in the dictionary into each category, thereby the whole task can be treated as a multi-class classification issue. The method is based on the fact that the new words and the dictionary words have the similar syntactic patterns in large corpus. We construct the training data by distance supervision, so as to reduce the effort of manual annotation. Training process combines the shallow learning and the deep neural network, which achieves the promising results. The experimental results show that the deep neural network is able to learn the syntactic information, and effectively improve the accuracy in the mapping task.
  • Review
    YAN Rong; GAO Guanglai
    2016, 30(6): 156-163.
    Abstract ( ) PDF ( ) Knowledge map Save
    The classical Pseudo Relevance Feedback (PRF) usually chooses the document as the unit, which would decrease the quality of expansion due to the larger extraction unit. Applying the topic analysis techniques, this paper proposes to use the semantic content of text as the expansion unit. Based on the proposed pseudo document description of each document in collection, the expansion terms are decided by using implicit diversification on the more subtle document content level. The experimental results on real NTCIR8 dataset show an clear improvement in terms of PRF performance.
  • Review
    LI Guochen; LIU Shulin; YANG Zhizhuo; LI Ru; ZHANG Hu; QIAN Yili
    2016, 30(6): 164-172.
    Abstract ( ) PDF ( ) Knowledge map Save
    Reading comprehension QA for Chinese College Entrance Examination is much more difficult than general reading comprehension QA in that it requires deeper linguistic analysis technology to understand the question, and the semantic correlation between the answers and questions. This paper proposes to extract the candidate answer sentences by frame semantic match and frame-frame semantic relation, and the manifold-ranking model are applied to propogate the frame semantic relevancy to decide the top-four candidate answers. The accuracy and recall on the college entrance examination of Beijing in recent twelve years is 53.65% and 79.06%, respectively.
  • Review
    WANG Yaohua; LI Zhoujun; HE Yueying; CHAO Wenhan; ZHOU Jianshe
    2016, 30(6): 173-181.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on the existing methods, including LDA model, paragraph vector, word vector text, we extract four kinds of text semantic dispersion representations, and apply them on the automatic essay scoring. This paper gives a vector form of the text semantic dispersion from the statistical point of view and gives a matrix form from the perspective of decentralized text semantic dispersion, experimented on the multiple linear regression, convolution neural network and recurrent neural network. The results showed that, on the test data of 50 essays, after the addition of text semantic dispersion feature, the Root Mean Square Error is reduced by 10.99% and the Pearson correlation coefficient increases 2.7 times.
  • Review
    HUANG Peijie; WANG Jundong; KE Zixuan; LIN Piyuan
    2016, 30(6): 182-189.
    Abstract ( ) PDF ( ) Knowledge map Save
    Due to the short length, diversity, openness and colloquial features of out-of-domain (OOD) utterances, such dialogue act (DA) recognition for OOD utterances remains a challenge in domain specific spoken dialogue system. This paper proposes an effective DA recognition method using the random forest and external information. The unlabeled Weibo dataset, which is not domain specific yet possesses the similar characteristic of colloquialism and diversity with the spoken dialogue, is used to train the word embedding by unsupervised learning method. The trained word embedding provides similar computing for out of vocabulary (OOV) words in the training and test OOD utterances. The evaluation on a Chinese dialogue corpus in restricted domain shows that the proposed method outperforms some state-of-the-art short text classification methods for DA recognition.
  • Review
    Ayiguli Halike;Hasan Wumaier;Tuergen Yibulayin;
    Kahaerjiang Abiderexiti; Maihemuti Maimaiti
    2016, 30(6): 190-200.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Chinese-Uyghur statistical machine translation system for times, numerals and quantifiers generalization ability are relatively weak. This paper uses a corpus approach to mine and extract the Chinese times, numerals and quantifier, realizing context based ambiguous quantifier translation. Experimental results show that the proposed method achieves 93.23%, 90.15%, 96.55%, and 87.58% in F-measure for the translation of times, numerals, unambiguous quantifiers and ambiguous quantifiers.
  • Review
    WANG Nan;XU Jin’an;MING Fang;CHEN Yufeng ;ZHANG Yujie
    2016, 30(6): 201-207.
    Abstract ( ) PDF ( ) Knowledge map Save
    The suffixes of Japanese predicates have complex formation of different voice. Both passive and potential predicates are formed with the same suffix which originated from the same stem, which cause mistranslation in statistical machine translation. In this paper, a new method has been proposed for rule selection among different voice. Maximum entropy models are built to effectively classify passive and potential voice, and then voice features are integrated into the log-linear model translation model. In Japanese to Chinese translation task, large scale experiment shows that our approach improves the translation performance from 41.50 to 42.01 in BLEU score, and the informativness is 2.71% higher according to the human evaluation results.
  • Review
    TANG Wenwu; GUO Yi;; XU Yongbin; FANG Xu
    2016, 30(6): 208-214.
    Abstract ( ) PDF ( ) Knowledge map Save
    The identification of the default objects and attributes in a comment is important in sentiment analysis for the commerce website’s reviews. To resolve the default comment objects and attributes, this paper proposes an effective identification method based on Conditional Random Fields (CRF). After applying an emotion dictionary to locate the opinion comments, we treat this task as a sequence labeling problem, and choose the lexical and dependency parsing elements as features. The evaluation results prove the proposed method with reasonable good accuracy and recall rates.
  • Review
    SUN Xiao; HE Jiajin; REN Fuji
    2016, 30(6): 215-223.
    Abstract ( ) PDF ( ) Knowledge map Save
    In social media, there are a lot of ironies or satires, which imply certain emotional tendencies. However, the pragmatic tendency of these special language phenomena is most often a far cry from its literal meaning, which challerges the text sentiment analysis in social media. This paper studies irony recognition in Chinese social media, and constructs a corpus contains irony and satire. It demonstrates the importance of structural and semantic features of ironies in text recognition. This paper also presents an efficient multi-feature hybrid neural network model, which fuses the Convolutional Neural Network and LSTM sequential models. The experimental resitst prove that the proposed model is superior to the traditional neural network models and BOW (bag-of-words) model.
  • Review
    Huaquecairang;ZHAO Haixing
    2016, 30(6): 224-229.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a discriminative method of identifying the clause to solve the performance decrease caused by Tibetan compound sentence. In this method, the complex sentence is first divided into different syntactic analysis units according to the inherent features of conjunctions. Then each clause is parsed independently. Finally the whole dependency tree is generated by merging the parse of each clause. Experimental results show that the method could decrease the complexity of parsing, and boost the parsing accuracy up to 88.72%.
  • Review
    Hasi; Buyinqiqige
    2016, 30(6): 230-235.
    Abstract ( ) PDF ( ) Knowledge map Save
    Mongolian homographs disambiguation is one of the difficulties of the Mongolian information processing. This paper puts forward a method of homonyms disambiguation based on Mongolian nouns semantic network. Finally, the experimental results of the homograph disambiguation are provided.