2017 Volume 31 Issue 6 Published: 15 December 2017
  

  • Select all
    |
    Language Analysis and Calculation
  • Language Analysis and Calculation
    WANG Mengxiang,RAO Qi, GU Cheng, WANG Houfeng
    2017, 31(6): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Acquisition and Expression of metaphorical knowledge is the basis of metaphor computation. In this paper, metaphor knowledge is regarded as the relationship between original and target domain, described by the features and attributes. In order to get Chinese concrete nouns’ metaphor features, we use 2 methods: one is idiomatic expression importing, and the other is syntax pattern matching. For the metaphor knowledge of idiom,we derive the accurate metaphorical features and attributes from idiomatic dictionary definitions. As for general concrete nouns, metaphor knowledge is more complex and we mainly rely on corpus and search engine to acquire different metaphorical features and corresponding attributes of the same noun by keyword and syntactic matching. This work may make sense in the construction of Chinese nouns’ semantic feature system and the understanding of Chinese metaphorical sentences.
  • Language Analysis and Calculation
    BEI Chao, HU Po
    2017, 31(6): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    The neural network model is widely used in natural language processing, image recognition and other fields. It is still an open issue whether the prior knowledge practiced by the traditional methods has impact on the task of neural network model. In this paper, we explore the influence of linguistic prior knowledge on neural network models in several NLP tasks. According to the characteristics of different tasks, we compare the effects of the different prior knowledge and the different input location on the different neural network models. Through a large number of comparative experiments, the results show that in some reasonable locations of some neural network, the prior knowledge can speed up the model’s convergence speed and improve the result, while it is not applicable for all conditions.
  • Language Analysis and Calculation
    RAO Gaoqi, LI Yuming
    2017, 31(6): 18-24.
    Abstract ( ) PDF ( ) Knowledge map Save
    State-of-the-art research tend to divide modern Chinese into 4 periods according to the political history: new culture movement to 1949, 1950—1966, 1967—1976, and 1977 till now. Though written language is deeply influenced by the social and political movements, language evolve by its own pattern, and language staging should be based on language data.
    In this paper, we regards the langage staging as a text classification problem. Based on the time sensitive words and its frequency as features, K-means and EM algorithm are applied to cluster the corpus of 70 years of “People’s Daily”. Hierarchical staging scheme is formed and shown in divding tree, revealing the beginning of Reform and Open Policy as divide crest of written language use in the past century.
  • Language Analysis and Calculation
    LIU Tong, HUANG Degen, ZHANG Cong
    2017, 31(6): 25-32.
    Abstract ( ) PDF ( ) Knowledge map Save
    A method of prepositional phrase recognition based on fusion of multiple models is proposed to deal with coordinate prepositional phrases and improve the performance of nested prepositional phrase recognition. First, a simple noun phrase recognition model is used to identify and merge the phrases in the corpus in order to simplify corpus and reduce internal complexity of prepositional phrases, Then, the CRF model is used to identify the inner layer of the nested prepositions phrases, i.e. if the preposition phrases is nested, recognize the inner layer, otherwise, recognize the whole preposition phrase, Finally, merge the recognized inner prepositional phrases in the corpus and modify the feature information in order to train a new model for outer prepositional phrase recognition. In addition, after the recognition of both inner and outer prepositional phrases, a double error correction system is used to correct the recognized phrases. Five-fold cross validation is conducted on the corpus of People’s Daily of 2000 including 7 028 prepositional phrases, and the results achieve 94.11% in precision, 94.02% in recall, and 94.06% in F-measure, outperformaing the baseline by 1.09%, 1.07%, 1.08%.
  • Language Analysis and Calculation
    LI Tianshi, LI Qi, WANG Wenhui, CHANG Baobao
    2017, 31(6): 33-40.
    Abstract ( ) PDF ( ) Knowledge map Save
    Paraphrase identification is an important sentence semantic understanding task. In this paper, we present a light-weight memory based recurrent neural network with sematic role features for this issue. The proposed single layer recurrent network alleviates the gradient disappearance and gradient explosion which aggravate by multilayer neural networks. We employ semantic role features to describe the semantic relationships between two sentneces. On the test set of Microsoft Research Paraphrase Corpus, we achieve 84.3% in F1 score, which is competitive compared with multilayer neural network models.
  • Language Analysis and Calculation
    WU Jialin, TANG Jintao, LI Shasha, WANG Ting
    2017, 31(6): 41-49.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a neural network based method for Chinese Word Segmentation to enhance its adaptability and flexibility when transformed to a new domain. Our method is based on the idea of revising the results of an existing segmenter. This two-phase correction model does not depend on either the source domain data or the way of building a segmenter. However, the existing method based on the correction relies on the feature engineering, which is hard to be automatically adapted for different domains. We propose a neural network based corrector to conduct the domain adaptation, which does not require any hand-crafted features. Experimental results show that, the proposed method achieves better performance and higher robustness on domain text segmentation compared with the state-of-the-art approach, especially on the recall of OOV (out-of-vocabulary).
  • Language Analysis and Calculation
    ZHANG Liwen, WANG Ruibo, LI Ru, ZHAGN Sheng
    2017, 31(6): 50-57.
    Abstract ( ) PDF ( ) Knowledge map Save
    Frame disambiguation is to assign, a suitable framework for the target word from the existing frame library. This task can solve the phenomenon of verb polysemy. Based on the distributed representation of words and sentences, a framework disambiguation model based on distance and word similarity matrix is proposed. Compared with the traditional methods, the model effectively avoids the manual feature selection. The accuracy of frame disambiguation reaches 65.71%, followed by the significance test and consistency test with the current best model.
  • Language Analysis and Calculation
    HUO Huan,ZHANG Wei,LIU Liang,LI Yang
    2017, 31(6): 58-66.
    Abstract ( ) PDF ( ) Knowledge map Save
    When most of the neural network models still focus on the sequential texts, two recently-proposed tree-based models, TreeLSTMs and TBCNNs, perform well on multiple natural language processing tasks by employing structural information. Since the former faces a serious problem of inefficient training due to the structure-related computations, this paper proposes a hybrid neural network model Quasi-TreeLSTMs, which is based on the tree convolution and pooling method of TBCNNs to mimic the TreeLSTMs’ operations. The model has two different variants Dependency Quasi-TreeLSTMs and Constituency Quasi-TreeLSTMs, in accordance with two kinds of syntax trees. The experimental results show that the performance of Quasi-TreeLSTMs is excellent in both sentiment classification and semantic similarity tasks.
  • Language Analysis and Calculation
    DU Shujing, XU Fan, WANG Mingwen
    2017, 31(6): 67-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse coherence modeling is a fundamental problem in natural language processing. Existing coherence models fall in tow categories: the entity based coherence model and the deep learning based model. The entity based coherence model relys on exact features, while the deep learning based model do not take full consideration of explicit entity links among sentences across discourse. This paper extracts the entity information on the adjacent sentences within the discourse, represents them in distributed embedding, and finally integrate them into the sentence-level bidirectional LSTM model. The experiment on Chinese and English sentence ordering task and Chinese-English statistical machine translation coherence detection task shows that our model outperforms the state-of-the-art ones, especially in Chinese.
  • Language Resources Construction
  • Language Resources Construction
    ZHOU Qiang
    2017, 31(6): 75-82.
    Abstract ( ) PDF ( ) Knowledge map Save
    Dialog act (DA) analysis is an edge-cutting point for further dialog understanding model. Based on previous studies in dialog act annotation schemes, this paper designs a new dialog act annotation scheme for Chinese daily conversation. The core function subclasses for the subjective and objective statement and the positive and negative response are introduced to enhance the DA descriptive ability. Two descriptive mechanisms of coherent rhetorical pairs and DA dependency pairs are combined to form a more complete description system for session structure. The topic thread analysis mechanism is introduced to effectively organize the topic-changing trend in conversation. DA tagging experiments on 500 daily conversation fragments reveal about 90% tagging macro consistency on two independent annotators. The result indicates that the designation of the current DA tag set has good operability, and it can meet the needs to describe the functional behavior patterns of Chinese daily conversation.
  • Language Resources Construction
    TAN Xiaoping
    2017, 31(6): 83-92.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper builds a comparable corpus of 110 000 characters which contains three types of corpus data, i.e. natural data, textbook corpus of TCSL, Inter-language Corpus. It also presents the syntax-semantic annotation scheme of Ba-sentence and annotates 1 556 Ba-sentences. Statistical analyses are conducted from both semantic and syntax aspects. The data shows that the Ba-sentence of expressing specific space transfer, describing the way and frequency of action, expressing information transfer and causative meaning in the Chinese-texts are more than necessary. However, the Ba-sentence of expressing abstract space transfer and judgment in the Chinese-texts are less than necessary. The Learners are familiar with the ba-sentence of expressing the result. But they need pay more attention on S+ba+N+V+directional complement, S+ba+N1+V+dao+N2 and another 15 types of Ba-sentence.
  • Language Resources Construction
    LI Bin, WEN Yuan, SONG Li, BU Lijun, QU Weiguang, XUE Nianwen
    2017, 31(6): 93-102.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a new sentence-level meaning representation, abstract meaning representation (AMR) uses a rooted acyclic directed graph to represent the meaning of a sentence. A large AMR bank has been constructed for English, but the concepts of an AMR graph are not aligned to the words in a sentence, which increases the difficulty in manual annotation as well as automatic parsing. This paper describes the construction of a Chinese AMR corpus, based on guidelines adapted from English for Chinese-specific properties. We also designs an efficient annotation framework that incorporates concept-to-word alignment, taking advantage of the morphology-poor nature of Chinese. We have annotated the AMRs of 6 923 sentences selected from the Chinese TreeBank, among which 48% of the sentences are graphs, 1% of the sentences are cycles, and 32% have non-projective subtrees. We plan to publicly release this data for linguistic and NLP research.
  • Machine Translation
  • Machine Translation
    LI Yachao, XIONG Deyi, ZHANG Min, JIANG Jing, MA Ning, YIN Jianmin
    2017, 31(6): 103-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural machine translation (NMT), which is a new machine translation method based on sequence-to-sequence learning via neural network, has surpasses statistical machine translation (SMT) in several language pairs gradually. This paper conducted experiment of attention based NMT on Tibetan-Chinese translation task, and adopted transfer learning to overcome the data sparsity problem. Experimental results show that the transfer learning method proposed is simple and effective, resulting 3 BLEU score improvement compared with the phrase-based SMT. Analysis of translations is also conducted to discusses the merits and shortcomings of NMT.
  • Other Language in/around China
  • Other Language in/around China
    Maihemuti Maimaiti, Kahaerjiang Abiderexiti, Aishan Wumaier, Tuergen Yibulayin, WANG Lulu
    2017, 31(6): 110-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the characteristics o of Uyghur location names, a conditional random fields model combined with rules is proposed for Uyghur location name recognition. In addition to the word and the part-of-speech features, morphological and contextual features are employed, including syllable, similar words by word embedding, common gazetteer, location trigger words, the common affixes of location names. The experimental results show that these features have a great impact on the recognition performance. Error are analyzed, and a rule-based post-processing method is apied to further improve the recognition performance. The precision, recall and F-score of the system finally reach 94.68%, 89.52% and 92.03%, respectively.
  • Other Language in/around China
    JIN Guozhe, CUI Rongyi
    2017, 31(6): 119-124.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic Korean word spacing, which runs parallel to Chinese word segmentation, is essential to Korean natural language processing. First of all, to overcome the disadvantage of traditional method dependent on manual extracted feature, we propose a Korean spacing-enhanced character embedding model KWSE. Through this model, we can obtain the character embedding containing semantic and spacing polarity information. Secondly, we combine Korean spacing-enhanced character embedding with LSTM-CRF to achieve Korean spacing task. The experimental result shows that our method achieved 92.86% in F1-score, which is better than other methods.
  • Other Language in/around China
    TANG Moming, ZHU Mingwei, YU Zhengtao, TANG Peili,GAO Shengxiang
    2017, 31(6): 125-131,139.
    Abstract ( ) PDF ( ) Knowledge map Save
    In view of the cross-lingual correlation analysis of Chinese and Vietnamese bilingual news events, this paper studies the correlation analysis method of Chinese and Vietnamese bilingual news events, which is essentially a kind of multi-lingual and multi - text understanding issue. In this paper, we propose a local intimacy propagation algorithm based on factor graph. Specifically, we use bilingual topic model to get the bilingual topics and topic probabilistic distributions from bilingual document. Then we build events’ factor graph based on event text similarity, calculating the influence between interrelated events on the factor graph under the same topic . by local intimacy propagation algorithm. Finally we get the influence topology of events under different topics. Experiments results show that the method we propose have achieved better result compared to the traditional method.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    SUN Xin, WANG Houfeng
    2017, 31(6): 132-139.
    Abstract ( ) PDF ( ) Knowledge map Save
    Intent determination (ID) and slot filling (SF) are two major tasks in spoken language understanding (SLU). The former is a classification problem, which judges the intention of utterance. The later can be treated as a sequence labeling problem, assigning key information as specific symbols. This paper proposes a LSTM (long short-term memory network) joint model combined with attention and CRF (conditional random field). In ID problem, the weighted sum of output layer’s vectors is used in classification task as the utterance’s representation. In SF problem, this paper consideres the transfers between labels and computed probabilities on the sequence-level. This model is verified on both Chinese and ATIS English corpora.
  • Information Retrieval and Question Answering
    LI Weikang, LI Wei, WU Yunfang
    2017, 31(6): 140-146.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper investigates the combination of Chinese character and word embeddings in deep learning. We propose to do experiments considering shallow and deep combinations based on word and character. In order to demonstrate the effectiveness of combination, we present a compare-aggregate model solving the problem of question answering. Extensive experiments conducted on the open DBQA data demonstrate that the effective combination of characters and words significantly improves the system achieving comparable results with state-of-art systems.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    HE Xinyu, LI Lishuang
    2017, 31(6): 147-154.
    Abstract ( ) PDF ( ) Knowledge map Save
    The trigger detection is of significance in the biomedical event extraction. The existing trigger detection methods are almost one-stage methods based on shallow machine learning, which demands on heavy training on the rich domain knowledge and sufficient manual features. In this paper, we propose a two-stage trigger detection method based on Bidirectional Long Short Term Memory (BLSTM), which divides trigger detection into recognition stage and classification stage. This approach can relieve the issue of imbalance class effectively, and avoid the cost of manual feature extraction. In addition, to obtain more semantic information, we use the large-scale corpus downloaded from the PubMed database to train the dependency word embeddings, which effectively improves the recognition performance of trigger detection. On the multi-level event extraction (MLEE) corpus dataset, our method achieves an F-score of 78.46%, which outperforms the state-of-the-art systems.
  • Information Extraction and Text Mining
    YING Wenhao, LI Sujian, SUI Zhifang
    2017, 31(6): 155-161.
    Abstract ( ) PDF ( ) Knowledge map Save
    The key to extractive summarization lies in the determination of the importance of sentences, This paper proposes a method to model the relations between a sentence and its topics, and evaluates the topical importance of the sentence. To deal with the lack of golden references, the paper proposes a semi-supervised training framework based on learning-to-rank, which is able to exploit the unlabeled news documents. The experiments on DUC2004 multi-document summarization data verify that the proposed feature of topical importance is an effective supplement to heuristic features and can improve the quality of summaries.
  • Information Extraction and Text Mining
    LIU Zeyu, MA Longlong, WU Jian, SUN Le
    2017, 31(6): 162-171.
    Abstract ( ) PDF ( ) Knowledge map Save
    Image captioning is a cross-domain task which connects computer vision, natural language processing and machine learning. As a key technology of multimodal processing, it has made remarkable progress in the recent years. Research on image caption generation has typically focused on generating a caption in English for an image, but generating Chinese caption is lack of research. In this paper, we propose a method generating Chinese image caption based on multimodal neural network. This method belongs to the family of encoder-decoder. Encoder based on convolutional neural network, consists of single-label visual feature extraction network and multi-label keyword prediction network. Decoder based on long short-term memory, consists of multimodal caption generation network. During the process of decoding, we propose four multimodal caption generation methods: CNIC-X, CNIC-H, CNIC-C and CNIC-HC. Experimental results on Chinese multimodal dataset Flickr8k-CN show that the proposed method outperforms state-of-the-art Chinese image captioning methods.
  • Information Extraction and Text Mining
    LI Guochen,ZHANG Yaxing,LI Ru
    2017, 31(6): 172-179,189.
    Abstract ( ) PDF ( ) Knowledge map Save
    Discourse Relation Recognition is a challenging sub-task in discourse analysis. The traditional discourse relation analysis aims to use the local feature of discourses to analyze the discourse relation. Since the local feature cannot directly explain the external semantic relation of the discourse unites, we recognize the discourse relation based on Chinese framenets and analyze it via frame semantics. In this paper, we can recognize the discourse relation by analyzing the discourse units with the targets in Chinese framenets. Experiments show that the core target can perfectly express the core semantics of discourse unites and improve the performance of discourse relation recognition.
  • Information Extraction and Text Mining
    XU Yuhong, HUANG Peijie
    2017, 31(6): 180-189.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper address?the?challenge to existing text classification methods brought by instant text information. Considering the characteristics of small scale labeled data, this paper proposes a semi-supervised classification method based on ensemble learning through optimized sampling. First, through a new optimized sampling strategy, several new sub classifier training sets are obtained with the purposeof increasing the diversity between training set and reduce the diffusion range of noise. Then, the voting mechanism based on confidence multiplication is used to integrate the prediction results, and the unlabeled data are labeled. Finally, the appropriate amount of data is selected to update the training model. The experimental results show that our approach has better classification performance in long text and short text.
  • Sentiment Analysis and Social Computing
  • Sentiment Analysis and Social Computing
    SUN Qingying,WANG Zhongqing,ZHU Qiaoming,ZHOU Guodong
    2017, 31(6): 190-195,204.
    Abstract ( ) PDF ( ) Knowledge map Save
    Business attributes are the attributes possessed by the business itself, such as the dining environment, the parking spaces. Business attributes are of significant influence on the customers decisions. For example it can help customers learn whether a restaurant provides parking space if they drive there for dinner. In this paper, we propose a novel business attributes extraction model based on integer linear programming, which extracts business attributes from the published comments of the customers. First we use the maximum entropy classifier to extract a single business attribute from the comments, then we employ integer linear programming model to connect different attributes for getting collaborative optimization. The experimental results show that the proposed method can effectively extract the business attributes.
  • NLP Application
  • NLP Application
    CHEN Gong, LIANG Maocheng
    2017, 31(6): 196-204.
    Abstract ( ) PDF ( ) Knowledge map Save
    This study takes Pattern Grammar as the theoretical basis, among which verb patterns are formalized by a grammar formalism — “Link Grammar”, to reconstruct the link grammar verb dictionary, aiming to build a better verb form error detection system for Chinese EFL learners' written English. The test result shows that the verb form error detection capability and parsing performance of the extended link grammar dictionary are improved. The recall rate of verb form error detection by the reconstructed dictionary shows an increase of 4.5 % than the original and the accuracy rate 15.7 %higher. Besides, link grammar with the reconstructed dictionary can parse correct sentences 12.2 % higher than the original. The study shows that the linguistic theory (Pattern Grammar) and grammar formalism (Link Grammar) adopted in this study can be satisfactorily applied to the building of verb form error detection system for Chinese EFL learners' written English.
  • NLP Application
    LI Xia, WEN Qifan
    2017, 31(6): 205-213.
    Abstract ( ) PDF ( ) Knowledge map Save
    Existing off-topic essay detection method mainly uses the content vector to represent the composition which sometimes results in low accuracy due to noise words. In this paper, we propose an unsupervised off-topic essay detection method based on the topic words and the local density thresholds. Firstly, Latent Dirichlet Allocation is used to predict essay’s topic distribution and the topic words are extracted according to different weights of the topics. Secondly, we use distributed word vector representation to find the similar words as the expansion of the title, and then compute on-topic score of all the test essays using our new similarity calculation method. Finally, we propose a local density threshold extraction method to extract the off-topic threshold automatically and determine off-topic essay. The experimental results on eight sets totaling 9381 essays show that our algorithm can significantly improve the F-measure compared to the baseline method. After adding the spelling correction preprocessing, the average F-measure value over all essay sets reaches 79.64%, and the best F-measure value of the eight sets is 96.1%.
  • NLP Application
    CHEN Xin, WANG Suge, LI Deyu, TAN Hongye, CHEN Qian, WANG Yuanlong
    2017, 31(6): 214-222.
    Abstract ( ) PDF ( ) Knowledge map Save
    Language style plays a significant role in reading comprehension and appreciation in the college entrance examination. However, the hierarchy of classification varies with different exam points. The task of language style discrimination and appreciation is regarded as a hierarchical classification problem. Guided by class labels, original clusters corresponding to the specific categories are acquired by using graph segmentation algorithm. Then a hierarchical clustering is applied to generate a hierarchy of language style, on which a hierarchical classifier based on SVM is trained. During answering appreciation question, the hierarchy of classification, which is determined by the question stem, is used to discriminate the language style of reading materials. Finally, the answers are generated based on the combination of language style and knowledge base. The experiment results show the effectiveness of our proposed method which can provide technical support for reading comprehension and appreciation.