Most Read
  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • ZHANG Yu-jie,Kazuhide Yamamoto
    . 2003, 17(6): 32-39.
    Baidu(12)
    One of the key issues in spoken language translation is how to deal with unrestricted expressions in spontaneous utterances. This research is centered on the development of a Chinese paraphraser that automatically paraphrases utterances prior to transfer in Chinese-Japanese spoken language translation. In this paper , a pattern-matching approach to paraphrasing is proposed for which only morphological analysis is required. In addition , a pattern construction method is described through which paraphrasing patterns can be efficiently learned from a paraphrase corpus and human experience. Using the implemented paraphraser and the obtained patterns , a paraphrasing experiment was conducted and the results were evaluated.
  • Review
    . 1988, 2(3): 42-49.
    SYNAC(The Syntactic Analyzer for Chinese)是基于一定语义分析的、以定子句文法作为描述工具的汉语句法分析器, 能分析比较复杂的汉语句子。采用了填槽技术和反向推理策略。本文着重介绍SYNAC设计的基本思想和主要方法, 最后, 作了一些评价。
  • Survey
    FENG Yang, SHAO Chenze
    . 2020, 34(7): 1-18.
    Machine translation is a task which translates a source language into a target language of the equivalent meaning via a computer, which has become an important research direction in the field of natural language processing. Neural machine translation models, as the main stream in the reasearch community, can perform end-to-end translation from source language to target language. In this paper, we select several main research directions of neural machine translation, including model training, simultaneous translation, multi-modal translation, non-autoregressive translation, document-level translation, domain adaptation, multilingual translation, and briefly introduce the research progresses in these directions.
  • Survey
    BYAMBASUREN Odmaa, YANG Yunfei, SUI Zhifang, DAI Damai, CHANG Baobao, LI Sujian, ZAN Hongying
    . 2019, 33(10): 1-7.
    The medical knowledge graph is the cornerstone of intelligent medical applications. The existing medical knowledge graphs are not enough from the perspectives of scale, specification, taxonomy, formalization as well as the precise description of the knowledge to meet the needs of intelligent medical applications. We apply natural language processing and text mining techniques with a semi-automated approach to develop the Chinese Medical Knowledge Graph (CMeKG 1.0) . The construction of CMeKG 1.0 refers to the international medical coding systems such as ICD-10, ATC, and MeSH, as well as large-scale, multi-source heterogeneous clinical guidelines, medical standards, diagnostic protocols, and medical encyclopedia resources. CMeKG covers types such as diseases, drugs, and diagnosis/treatment technologies, with more than 1 million medical concept relationships. This paper presents the description system, key technologies, construction process and medical knowledge description of CMeKG 1.0, serving as a reference for the construction and application of knowledge graphs in the medical field.
  • Survey
    WEI Zhongyu, FAN Zhihao, WANG Ruize, CHENG Yijing, ZHAO Wangrong, HUANG Xuanjing
    . 2020, 34(7): 19-29.
    In recent years, increasing attention has been attracted to the research field related to cross-modality, especially vision and language. This survey focuses on the task of image captioning and summarizes literatures from four aspects, including the overall architecture, some key questions for cross-modality research, the evaluation of image captioning and the state-of-the-art approaches to image captioning. In conclusion, we suggest three directions for future research, i.e., cross-modality representation, automatic evaluation metrics and diverse text generation.
  • Sentiment Analysis and Social Computing
    WU Xiaohua, CHEN Li, WEI Tiantian, FAN Tingting
    . 2019, 33(6): 100-107.
    Short text sentiment analysis is a better method for judging the emotions of texts. It also has important applications in the fields of commodity reviews and public opinion monitor. The performance of the bidirectional recurrent neural network model based on the word attention mechanism relies heavily on the accuracy of word segmentation. In addition, the attention mechanism has more parameter dependencies, making the model less concerned with the internal sequence relationships of short texts. Aiming at the above problems, this paper proposes a Chinese short text sentiment analysis algorithm based on the character vector representation method combined with Self-attention and BiLSTM. Firstly, the short text is vectrized, then the BiLSTM network is used to extract texts context feature. Finally, the feature weights are dynamically adjusted by the self-attention mechanism, and the Softmax classifier obtains the emotion category. Experimental results on the COAE 2014 Weibo dataset and hotel review datasets show that character vectors are more suitable for short text than word-level text vector representations. The self-attention mechanism can reduce the external parameter dependence, so that the model can learn more key features of the text itself. Classification performance can be increased by 1.15% and 1.41%, respectively.
  • Survey
    WU Youzheng, LI Haoran, YAO Ting, HE Xiaodong
    . 2022, 36(5): 1-20.
    Over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the limits of modeling single modality, e.g., vision, speech and language. Going beyond such research progresses made in single modality, the rise of multimodal social network, short video applications, video conferencing, live video streaming and digital human highly demands the development of multimodal intelligence and offers a fertile ground for multimodal analysis. This paper reviews recent multimodal applications that have attracted intensive attention in the field of natural language processing, and summarizes the mainstream multimodal fusion approaches from the perspectives of single modal representation, multimodal fusion stage, fusion network, fusion of unaligned modalities, and fusion of missing modalities. In addition, this paper elaborate the latest progresses of the vision-language pre-training.
  • Machine Reading Comprehension
    LIU Kai, LIU Lu, LIU Jing, LV Yajuan, SHE Qiaoqiao, ZHANG Qian, SHI Yingchao
    . 2018, 32(10): 118-129.
    Machine Reading Comprehension (MRC) is a challenging task in the field of Natural Language Processing (NLP) and Artificial Intelligence (AI). 2018 NLP Challenge on Machine Reading Comprehension (MRC2018) aims to advance MRC technologies and applications. The challenge releases the largest scale, open-domain, application-oriented Chinese MRC dataset, provides an open sourced baseline systems and adopts improved evaluation metrics. Over one thousand teams registered for this challenge and the overall performance of the participant systems have been greatly promoted. This paper presents an overall introduction to MRC2018, and gives a detailed description of the evaluation task settings, evaluation organization, evaluation results and corresponding result analysis.
  • Information Retrieval and Question Answering
    LI Weikang, LI Wei, WU Yunfang
    . 2017, 31(6): 140-146.
    This paper investigates the combination of Chinese character and word embeddings in deep learning. We propose to do experiments considering shallow and deep combinations based on word and character. In order to demonstrate the effectiveness of combination, we present a compare-aggregate model solving the problem of question answering. Extensive experiments conducted on the open DBQA data demonstrate that the effective combination of characters and words significantly improves the system achieving comparable results with state-of-art systems.
  • Survey
    CUI Lei, XU Yiheng, LYU Tengchao, WEI Furu
    . 2022, 36(6): 1-19.
    Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques to automatically read, understand and analyze business documents. It is an important interdisciplinary study involving natural language processing and computer vision. In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI tasks, such as document layout analysis, document information extraction, document visual question answering, and document image classification etc. This paper briefly introduces the early-stage heuristic rule-based document analysis, statistical machine learning based algorithms, as well as the deep learning-based approaches especially the pre-training approaches. Finally, we also look into the future direction of Document AI.
  • Article
    DU Hui; XU Xueke; WU Dayong ; LIU Yue; YU Zhihua ; CHENG Xueqi
    . 2017, 31(3): 170-176.
    We present a method for sentiment classification based on sentiment-specific word embedding (SSWE). Word embedding is the distributed vector representation of a word with fixed length in real topological space. Algorithms for learning word embedding, like word2vec, obtain this representation from large un-annotated corpus, without considering sentiment information. We make sentiment improvement for the initial word embedding and get the sentiment-specific word embedding that contains both syntactic and sentiment information.Then text representations are built based on sentiment-specific word embeddings. Sentiment polarities of texts are obtained through machine learning approaches. Experiments show that the presented algorithm performs better than sentiment classification method based on texts modeling by word, N-gram and word embeddings from word2vec.
  • Review
    LIU Dexi , NIE Jianyun, ZHANG Jing, LIU Xiaohua, WAN Changxuan, LIAO Guoqiong
    . 2016, 30(4): 193-205.
    Sentimental analysis heavily relies on resources such as sentimental dictionaries. However, it is difficult to manually build such resources with a satisfactory coverage. A promising avenue is to automatically extract sentimental lexicons from microblog data. In this paper, we target the problem of identifying new sentimental words in a Chinese microblog collection provided at COAE 2014. We observe that traditional measures based on co-occurrences, such as pointwise mutual information, are not effective in determining new sentimental words. Therefore, we propose a group of context-based features, N-Gram features, for classification, which can capture the lexical surroundings and lexical patterns of sentimental words. Then, a classifier trained on the known sentimental words is employed to classify the candidate words. We will show that this method works better than the traditional approaches. In addition, we also observe that, different from English, many sentimental words in Chinese are nouns, which cannot be discriminated using co-occurrence-based measures, but can be better determined by our classification method.
  • Knowledge Representation and Acquisition
    HONG Wenxing, HU Zhiqiang, WENG Yang, ZHANG Heng, WANG Zhu, GUO Zhixin
    . 2020, 34(1): 34-44.
    Legal knowledge centered cognitive intelligence is an important topic for judicial artificial intelligence. This paper proposes an automated knowledge graph construction approach for judicial case facts. Based on the pre-training model, models for entity recognition and relation extraction are presented. For the entity recognition task, two pre-training based entity recognition models are compared. For the relation extraction task, a multi-task joint semantic relation extraction model is proposed incorporating translating embeddings. The knowledge representation learning of case facts is obtained while completing the relation extraction task. For “motor vehicle traffic accident liability dispute”, compared with the baseline model, the entity recognition can be increased by 0.36 in F1 score, and the relation extraction by 2.37 F1 score. Based on the proposed method, a case facts knowledge graphs are established on a couple of hundred thousand judicial documents, enabling the semantic computing for judicial artificial intelligence applications such as case retrieval.
  • Survey
    ZHU Zhangli, RAO Yuan, WU Yuan, QI Jiangnan, ZHANG Yu
    . 2019, 33(6): 1-11.
    The attention mechanism has gradually become one of the popular methods and research issues in deep learning. By improving the source language expression, it dynamically selects the related information of the source language in decoding, which greatly improves the insufficiency issue of the classic Encoder-Decoder framework. On the basis of the issues in the conventional Encoder-Decoder framework such as long-term memory limitation, interrelationships in sequence transformation, and output quality of model dynamic structure, this paper describes a varied aspects on attention mechanism, including the definition, the principle, the classification, state-of-the-art researches as well as the applications of attention mechanism in image recognition, speech recognition, and natural language processing. Meanwhile, this paper further discusses the multi-modal attention mechanism, evaluation mechanism of attention, interpretability of the model and integration of attention with the new model, providing new research issues and directions for the development of attention mechanism in deep learning.
  • Sentiment Analysis and Social Computation
    LIANG Jun, CHAI Yumei, YUAN Huibin, ZAN Hongying, LIU Ming
    . 2014, 28(5): 155-161.
    Chinese micro-blog sentiment analysis aims to discover the user attitude towards hot events. Most of the current studies analyze the micro-blog sentiment by traditional algorithms such as SVM, CRF based on hand-engineered features. This paper explores the feasibility of performing Chinese micro-blog sentiment analysis by deep learning. We try to avoid task-specific features, and use recursive neural networks to discover relevant features to the tasks. We propose a novel model - sentiment polarity transition model - based on the relationship between neighboring words of a sentence to strengthen the text association. The proposed method achieves a performance close to state-of-the-art methods based on the hand-engineered features, but saving a lot of manual annotation work.
  • Question-answering and Dialogue
    WANG Mengyu, YU Dingyao, YAN Rui, HU Wenpeng, ZHAO Dongyan
    . 2020, 34(8): 78-85.
    Multi-turn dialogue task requires the system to take care of context information while generating fluent answers. Recently, a large number of multi-turn dialogue models based on HRED(Hierarchical Recurrent Encoder-Decoder) model have been developed, reporting good results on some English dialogue datasets such as Movie-DiC. On a high-quality customer service dialogue corpus from real world to contestants released by Jingdong in 2018, this article investigates the performance of HRED model and explores possible improvements. It is revealed that the combination of the attention and ResNet mechanisms with HRED model can achieve significant improvements.
  • Survey
    LIN Wangqun, WANG Miao, WANG Wei, WANG Chongnan, JIN Songchang
    . 2020, 34(12): 9-16.
    Knowledge graph describes the concept, entity and their relationship in the form of semantic network. In this paper, we formally describe the basic concepts and the hierarchical architecture of knowledge graph. Then we review the state-of-the-art technologies of information extraction, knowledge fusion, schema, knowledge management. Finally, we probes into the application of knowledge graph in the military field, revealing challenges and trends of the future development.
  • Article
    ZHANG Hainan, WU Dayong, LIU Yue, CHENG Xueqi
    . 2017, 31(4): 28-35.
    Chinese NER is challenged by the implicit word boundary, lack of capitalization, and the polysemy of a single character in different words. This paper proposes a novel character-word joint encoding method in a deep learning framework for Chinese NER. It decreases the effect of improper word segmentation and sparse word dictionary in word-only embedding, while improves the results in character-only embedding of context missing. Experiments on the corpus of the Chinese Peoples' Daily Newspaper in 1998 demonstrates a good results: at least 1.6%, 8% and 3% improvements, respectively, in location, person and organization recognition tasks compared with character or word features; and 96.8%, 94.6%, 88.6% in F1, respectively, on location, person and organization recognition tasks if integrated with part of speech feature.
  • Language Analysis and Calculation
    YU Jingsong, WEI Yi, ZHANG Yongwei
    . 2019, 33(11): 57-63.
    Ancient Chinese differs from modern Chinese in words and grammar. Since there are no explicit marks among sentences in ancient Chinese texts, today's readers find it's hard to understand them. It is also difficult and requires expertise in a variety of fields to segment ancient text. We investigate to perform automatic texts segmentation and punctuation based on recent deep learning technologies. By pre-training a BERT (Bidirectional Encoder Representations from Transformers) model for ancient Chinese texts ourselves, we get the current state-of-the-art results on both tasks via fine-tuning. Comparing to traditional statistical methods and current BiLSTM+CRF solution, our approach significantly outperforms them by achieving F1-scores of 89.97% and 91.67% on small-scaled single category corpus and large-scaled multi-categories corpus,respectively. Especially, our approach shows its good generalization ability by achieving F1-score of 88.76% on a fully new Taoist corpus. On the punctuation task, our method F1 score reached 70.40%, which exceeded the baseline BiLSTM+CRF model by 12.15%.
  • Sentiment Analysis and Social Computing
    SONG Shuangyong, WANG Chao, CHEN Chenglong, ZHOU Wei, CHEN Haiqing
    . 2020, 34(2): 80-95.
    AliMe is a recently developed chatbot, focused on intelligent customer service domain. The emotion analysis technologies have been successfully utilized in many modules of AliMe. The technical details of those emotion analysis based modules are presented, including user emotion detection, user emotion comfort, emotional generative chatting, customer service quality control, session satisfaction prediction and intelligent entrance for manual customer service. Furthermore, some user interface examples of those emotional modules are also introduced to improve understanding of their effects.
  • Information Retrieval and Question Answering
    SONG Haoyu, ZHANG Weinan, LIU Ting
    . 2018, 32(7): 99-108,136.
    The open domain dialogue system is challenged by effective multi-turn dialogues. Current neural dialogue generation models tend to fall into conversation black holes by generating safe responses, without considering the future information. Inspired by the global view of reinforcement learning methods, we present an approach to learn multi-turn dialogue policy with DQN (deep Q-network). We introduce a deep neural network to evaluate each candidate sentence and choose the sentence with the maximum future rewards, instead of the highest generation probability, as a response. The results show that our method improves the average dialogue turns by 2 in the automatic eva-luation and outperforms the baseline model by 45% in the human evaluation.
  • CHEN Kai-qu,ZHAO Jie,PENG Zhi-wei
    . 2004, 18(2): 59-66.
    Baidu(23)
    For now there are two effective methods to improve approximate string matching : bit-vector method and filter method. Since Chinese alphabet has many characters , it needs much computer memory for bit-vector method. This would be a problem for some little computer which has a small memory , such as embedded system. We present a new bit-vector method which needs only about 5% computer memory of original bit-vector method. And , we also utilize the fact that Chinese alphabet is very large and develop a new filter method , BPM-BM , for approximate string matching of Chinese text . It runs at least 14% faster than the known fasted algorithms. In most cases , our algorithm is even 1.5~2 times faster.
  • ZHENG Shi-fu,LIU Ting,QIN Bing,LI Sheng
    . 2002, 16(6): 47-53.
    Baidu(145)
    Question-Answering is a hot research field in Natural Language Processing ,which includes many kinds of NLP technology. This paper introduces the current research status and the methods that are often used in Question-Answering. In general ,a Question-Answering system is made up of three parts : Question Analysis ,Information Retrieval and Answer Extraction. This paper describes the main functions of these three parts and the common approach used in these parts in detail. At last ,this paper introduces the evaluation of Question-Answering system.
  • Review
    SUN Maosong ; CHEN Xinxiong
    . 2016, 30(6): 1-6.
    This paper aims to address the necessity and effectiveness of encoding a human annotated knowledge base into a neural network language model, using HowNet as a case study. Traditional word embedding is derived from neural network language model trained on a large-scale unlabeled text corpus, which suffers from the quality of resulting vectors of low frequent words is not satisfactory, and the sense vectors of polysemous words are not available. We propose neural network language models that can systematically learn embedding for all the semantic primitives defined in HowNet, and consequently, obtain word vectors, in particular for low frequent words, and word sense vectors in terms of the semantic primitive vectors. Preliminary experimental results show that our models can improve the performance in tasks of both word similarity and word sense disambiguation. It is suggested that the research on neural network language models incorporating human annotated knowledge bases would be a critical issue deserving our attention in the coming years.
  • Survey
    HOU Shengluan, ZHANG Shuhan, FEI Chaoqun
    . 2019, 33(5): 1-16.
    Text summarization has become an essential way of knowledge acquisition from mass text documents on the Internet. The existing surveys to text summarization are mostly focused on methods, without reviewing on the experimental datasets. This survey concentrates on evaluation datasets and summarizes the public and private datasets together with corresponding approaches. The public datasets are recorded for the data source, language and the way of access, and the private dataset are recorded with the scale, access and annotation methods. In addition, the formal definition of text summarization by each public dataset are provided. We analyze the experimental results of classical and latest text summarization methods on one specific dataset. We conclude with the present situation of existing datasets and methods, and some issues concerning them.
  • Review
    ZHAO Yan-yan, QIN Bing, CHE Wan-xiang, LIU Ting
    . 2008, 22(1): 3-8.
    Baidu(117)
    Event Extraction is an important research point in the area of Information Extraction. This paper makes an intensive study of the two stages of Chinese event extraction, namely event type recognition and event argument recognition. A novel method combining event trigger expansion and a binary classifier is presented in the step of event type recognition while in the step of argument recognition, one with multi-class classification based on maximum entropy is introduced. The above methods solved the data unbalanced problem in training model and the data sparseness problem brought by the small set of training data effectively, and finally our event extraction system achieved a better performance.
  • Language Resources Construction
    YAO Yuanlin, WANG Shuwei, XU Ruifeng, LIU Bin, GUI Lin, LU Qin, WANG Xiaolong
    . 2014, 28(5): 83-91.
    Baidu(9)
    The research on text emotion analysis has made substantial progesses in recent years. However, the emotion annotated corpus is less developed, especially the ones on micro-blog text. To support the analysis on the emotion expression in Chinese micro-blog text and the evaluation of the emotion classification algorithms, an emotion annotated corpus on Chinese micro-blog text is designed and constructed. Based on the observation and analysis on the emotion expression in micro-blog text, a set of emotion annotation specification is developed. Following this specification, the emotion annotation on micro-blog level is firstly performed. The annotated information includes whether the micro-blog text has emotion expression and the emotion categories corresponding to the micro-blog with emotion expressions. Next, the sentence-level annotation is conducted. Meanwhile, the annotation on whether the sentence has emotion expression and the emotion categories, the strength corresponding to each emotion category is annotated. Currently, this emotion annotated corpus consists of 14000 micro-blogs, totaling 45431 sentences. This corpus was used as the standard resource in the NLP&CC2013 Chinese micro-blog emotion analysis evaluation, facilitating the research on emotion analysis to a great extent.
  • Article
    WU Dongyin; GUI Lin; CHEN Zhao; XU Ruifeng
    . 2017, 31(1): 169-176.
    Sentiment analysis is an important topic in natural language processing research. Most existing sentiment analysis techniques are difficult to handle the domain dependent and sample bias issues, which restrain the development and application of sentiment analysis. To address these issues, this paper presents a sentiment analysis approach based on deep representation learning and Gaussian Processes transfer learning. Firstly, the distributed representations of text samples are learned based on deep neural network. Next, based on deep Gaussian processes, this approach selects quality samples with the distribution similar to testing dataset from additional dataset to expand the training dataset. The sentiment classifier trained on the expanded dataset is expected to achieve higher performance. The experimental results on COAE2014 dataset show that the proposed approach improved the sentiment classification performance. Meanwhile, this approach alleviates the influences of training sample bias and domain dependence.
  • Language Resources Construction
    LI Yanqun, HE Yunqi, QIAN Longhua, ZHOU Guodong
    . 2018, 32(8): 19-26.
    Nested named entities contain rich entities and semantic relations between them, which facilitates to improve the effectiveness of information extraction. Due to the lack of uniform and standard Chinese nested named entity corpora, currently it is difficult to compare the research works on Chinese nested named entities. Based on the existing named entity corpora, this paper proposes to use semi-automatic method to construct two Chinese nested named entity corpora. First, we use the annotation information in the Chinese named entity corpora to automatically construct as many nested named entities as possible, and then manually adjust them to meet our annotation requirements for Chinese nested entity in order to build high-quality Chinese nested named entity corpora. The preliminary experiment of nested named entity recognition both within and across the corpora shows that Chinese nested named entity recognition is still a quite difficult problem and requires further research.
  • Review
    MEI Lili,HUANG Heyan,ZHOU Xinyu,MAO Xianling
    . 2016, 30(5): 19-27.
    Sentiment analysis is a rapidly developing research topic in recent years, which has great research value and application value. Sentiment lexicon construction plays an increasingly important influence on the task . This paper summarizes the research progress on sentiment lexicon construction. Firstly, four kinds of methods are summarized and analyzed, including the method based on heuristic rules, the method based on graph, the method based on word alignment model and the method based on representation learning. Then, some popular corpus, dictionary resources and evaluation organizations are introduced. Finally, we conclude the topic and provide the development trends of sentiment lexicon construction.
  • Article
    LI Yang; GAO Daqi
    . 2017, 31(1): 140-146.
    Entities similarity is useful in many areas, such as recommendation system in E-commerce platforms, and patients grouping in healthcare, etc. In our task of calculating the entity similarity in a given knowledge graph, the attributes of every entity is provided, and a sample of entity pairs are provided with their similarity score. Therefore, we treat this task as a supervised learning problem, testing SVM, Logistic Regression, Random Forest, and Learning to rank models.
  • Survey
    CAO Qi, SHEN Huawei, GAO Jinhua, CHENG Xueqi
    . 2021, 35(2): 1-18,32.
    Popularity prediction over online social networks plays an important role in various applications, e.g., recommendation, advertising, and information retrieval. Recently, the rapid development of deep learning and the availability of information diffusion data provide a solid foundation for deep learning based popularity prediction research. Existing surveys of popularity prediction mainly focus on traditional popularity prediction methods. To systematically summarize the deep learning based popularity prediction methods, this paper reviews existing popularity prediction methods based on deep learning, categorizes the recent deep learning based popularity prediction research into deep representation based and deep fusion based methods, and discusses the future researches.
  • Survey
    CHEN Yulong, FU Qiankun, ZHANG Yue
    . 2021, 35(3): 1-23.
    In recent years, neural networks have gradually overtaken classical machine learning models and become the de facto paradigm for natural language processing tasks. Most typical neural networks are capable of dealing with data in Euclidean space. Due to the linguistic nature, however, the language information such as discourse and syntactic information is of graph structures. Therefore, there has been an increasing number of researches that use graph neural networks to explore structures in natural languages. This paper systematically introduces applications of graph neural networks in natural language processing areas. It first discusses the fundamental concepts and introduces three main categories of graph neural networks, namely graph recurrent neural network, graph convolutional network, and graph attention network. Then this paper introduces methods to construct proper graph structures according to different tasks, and to apply graph neural networks to embed those structures. This paper suggests that compared with focusing on novel structures, exploring how to use the key information in specific tasks to create corresponding graphs is more universal and is of more academic value, which can be a promising future research direction.
  • Review
    LIU Longfei, YANG Liang, ZHANG Shaowu, LIN Hongfei
    . 2015, 29(6): 159-165.
    Chinese micro-blog sentiment analysis aims to discover the user attitude towards hot events. This task is challenged by immense noises, rich new words, numerous abbreviations, vigorous collocation, together with the limited contextual information provided in the short texts. This paper explores the feasibility of performing Chinese micro-blog sentiment analysis by convolutional neural networks. To avoid task-specific features, character level embedding and word level embedding are adopted for convolutional neural networks(CNN). On the COAE 4th task corpus, the character level CNN achieves a sentiment prediction (in both binary positive/negative classification) accuracy of 95.42%, slightly better than the word level CNN yielding 94.65% accuracy. The results show that the convolutional neural networks model is promising in Chinese micro-blog sentiment analysis.
    Key words deep learning;sentiment analysis;convolutional neural networks;word embedding
       
       
       
  • Survey
    CEN Keting, SHEN Huawei, CAO Qi, CHENG Xueqi
    Journal of Chinese Information Processing. 2023, 37(5): 1-21.
    As a self-supervised deep learning paradigm, contrastive learning has achieved remarkable results in computer vision and natural language processing. Inspired by the success of contrastive learning in these fields, researchers have tried to extend it to graph data and promoted the development of graph contrastive learning. To provide a comprehensive overview of graph contrastive learning, this paper summarizes recent works under a unified framework to highlight the development trends. It also catalogues the popular datasets and evaluation metrics for graph contrastive learning, and concludes with the possible future direction of the field.
  • Information Extraction and Text Mining
    ZHONG Weifeng, YANG Hang, CHEN Yubo, LIU Kang, ZHAO Jun
    . 2019, 33(9): 88-95,106.
    Current research on automatic event extraction focuses on sentence-level corpus. However, due to the complexity and the diversity of event description in texts, a complete event is mentioned by multiple sentences in many cases. This paper first proposes an Attention-based Sequence Labeling model for joint extraction of entities and events. Compared with the pipeline of entity extraction plus event recognition, this joint labeling model improves the F-score by 1%. Then, we use Multi-Layer Perception to label the entities in the events and identify their roles. Finally, based on the labeling and identification results, this paper leverages integer linear programming for global reasoning, improving the F-score of document-level event extraction by 3% compared to the baseline.
  • Information Extraction and Text Mining
    XU Zhihao,HUI Haotian,QIAN Longhua,ZHU Qiaoming,
    . 2015, 29(5): 91-98.
    Classifying Wikipedia Entities is of great significance to NLP and machine learning. This paper presents a machine learning based method to classify the Chinese Wikipedia articles. Besides using semi-structured data and non-structured text as basic features, we also extend to use Chinese-oriented features and semantic features in order to improve the classification performance. The experimental results on a manually tagged corpus show that the additional features significantly boost the entity classification performance with the overall F1-measure as high as 96% on the ACE entity type hierarchy and 95% on the extended entity type hierarchy.
  • Review
    HONG Yu,ZHANG Yu,LIU Ting,LI Sheng
    . 2007, 21(6): 71-87.
    Baidu(266)
    Topic detection and tracking, as one of natural language processing technologies, is to detect unknown topic and track known topic from the information of news medium. Since its pilot research in 1996, several large-scale evaluation conferences have provided a good environment for evaluating technologies of recognition, collection and organization. As topic detection and tracking shares similar challenges with information retrieval, data mining and information extraction in abrupt and successive data, it has become a hot research issue in the field of nature language processing. This paper introduced the background, definition, evaluation and methods in topic detection and tracking, and explored its future development trend through analyzing current research.
  • Review
    LI Yachao, JIANG Jing, JIA Yangji, YU Hongzhi
    . 2015, 29(6): 203-207.
    TIP-LAS is an open source toolkit for Tibetan segmentation and POS tagging. The toolkit implements the Tibetan segmentation system based on syllable tagging by the CRF model, and integrates the maximum entropy model with syllables features for Tibetan POS tagging. In the experiments, this system achieves good results. The source code is shared in the Internet, together with the experimental corpus.
    Key words Tibetan; word segmentation; part of speech tagging; conditional random fields; maximum entropy
       
       
       
  • Review
    LI Sheng, KONG Fang, ZHOU Guodong
    . 2016, 30(4): 81-89.
    Recognizing implicit discourse relation is a challenging task in discourse parsing. In this paper, we propose an implicit discourse relation recognizing method in the Penn Discourse Treebank (PDTB) considering some traditional features (e.g., verbs, polarity, production rules, and so on), and provide a systematic analysis for our implicit discourse relation method. We apply all labeled data to build multiple classifiers, and use the adding rule to identify final classification result for each instance. We also use forward feature selection method to select an optimal feature subset for each classification task. Experimental results in the PDTB corpus show that our proposed method can significantly improve the state-of-the-art performance of recognizing implicit discourse relation.