Most Download
  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All
  • Most Downloaded in Recent Month
  • Most Downloaded in Recent Year
Please wait a minute...
  • Select all
    |
  • Review
    XU Jun, DING Yu-xin, WANG Xiao-long
    . 2007, 21(6): 95-100.
    Baidu(98)
    In this paper, we study how to apply machine learning techniques to solve sentiment classification problems. The main task of sentiment classification is to determine whether news or reviews is negative or positive. Naive Bayes and Maximum Entropy classification are used for the sentiment classification of Chinese news and reviews. The experimental results show that the methods we employed perform well. The accuracy of classification can achieve about 90%. Moreover, we find that selecting the words with polarity as features, negation tagging and representing test documents as feature presence vectors can improve the performance of sentiment classification. Conclusively, sentiment classification is a more challenging problem.
  • Review
    ZHAO Yan-yan, QIN Bing, CHE Wan-xiang, LIU Ting
    . 2008, 22(1): 3-8.
    Baidu(117)
    Event Extraction is an important research point in the area of Information Extraction. This paper makes an intensive study of the two stages of Chinese event extraction, namely event type recognition and event argument recognition. A novel method combining event trigger expansion and a binary classifier is presented in the step of event type recognition while in the step of argument recognition, one with multi-class classification based on maximum entropy is introduced. The above methods solved the data unbalanced problem in training model and the data sparseness problem brought by the small set of training data effectively, and finally our event extraction system achieved a better performance.
  • Survey
    ZHU Zhangli, RAO Yuan, WU Yuan, QI Jiangnan, ZHANG Yu
    . 2019, 33(6): 1-11.
    The attention mechanism has gradually become one of the popular methods and research issues in deep learning. By improving the source language expression, it dynamically selects the related information of the source language in decoding, which greatly improves the insufficiency issue of the classic Encoder-Decoder framework. On the basis of the issues in the conventional Encoder-Decoder framework such as long-term memory limitation, interrelationships in sequence transformation, and output quality of model dynamic structure, this paper describes a varied aspects on attention mechanism, including the definition, the principle, the classification, state-of-the-art researches as well as the applications of attention mechanism in image recognition, speech recognition, and natural language processing. Meanwhile, this paper further discusses the multi-modal attention mechanism, evaluation mechanism of attention, interpretability of the model and integration of attention with the new model, providing new research issues and directions for the development of attention mechanism in deep learning.
  • Survey
    FENG Yang, SHAO Chenze
    . 2020, 34(7): 1-18.
    Machine translation is a task which translates a source language into a target language of the equivalent meaning via a computer, which has become an important research direction in the field of natural language processing. Neural machine translation models, as the main stream in the reasearch community, can perform end-to-end translation from source language to target language. In this paper, we select several main research directions of neural machine translation, including model training, simultaneous translation, multi-modal translation, non-autoregressive translation, document-level translation, domain adaptation, multilingual translation, and briefly introduce the research progresses in these directions.
  • Survey
    WEI Zhongyu, FAN Zhihao, WANG Ruize, CHENG Yijing, ZHAO Wangrong, HUANG Xuanjing
    . 2020, 34(7): 19-29.
    In recent years, increasing attention has been attracted to the research field related to cross-modality, especially vision and language. This survey focuses on the task of image captioning and summarizes literatures from four aspects, including the overall architecture, some key questions for cross-modality research, the evaluation of image captioning and the state-of-the-art approaches to image captioning. In conclusion, we suggest three directions for future research, i.e., cross-modality representation, automatic evaluation metrics and diverse text generation.
  • Survey
    YUE Zengying, YE Xia, LIU Ruiheng
    . 2021, 35(9): 15-29.
    Pre-training technology has stepped into the center stage of natural language processing, especially with the emergence of ELMo, GTP, BERT, XLNet, T5, and GTP-3 in the last two years. In this paper, we analyze and classify the existing pre-training technologies from four aspects: language model, feature extractor, contextual representation, and word representation. We discuss the main issues and development trends of pre-training technologies in current natural language processing.
  • Language Resources Constraction
    ZHANG Kunli, ZHAO Xu, GUAN Tongfeng, SHANG Baiyu, LI Yumeng, ZAN Hongying
    . 2020, 34(6): 36-44.
    The medical text is an important data foundation for the implementation of intelligent healthcare. As a kind of semi-structured or unstructured data, the medical text needs to be labeled for entity and entity relationships, paving the way for text structuring, named entity recognition, and automatic relationship extraction. Aimed at constructing the Chinese medical knowledge graph, a semi-automated entity and relationship labeling platform is designed to integrate multiple algorithms for pre-labeling, schedule control, quality control and data analysis. Based on this platform, the medical knowledge graph entity and relationship labeling are carried out. The results show that the labeling platform can control the labeling process in the construction of text resources, ensure the labeling quality, and improve the labeling efficiency.
  • Review
    HONG Yu,ZHANG Yu,LIU Ting,LI Sheng
    . 2007, 21(6): 71-87.
    Baidu(266)
    Topic detection and tracking, as one of natural language processing technologies, is to detect unknown topic and track known topic from the information of news medium. Since its pilot research in 1996, several large-scale evaluation conferences have provided a good environment for evaluating technologies of recognition, collection and organization. As topic detection and tracking shares similar challenges with information retrieval, data mining and information extraction in abrupt and successive data, it has become a hot research issue in the field of nature language processing. This paper introduced the background, definition, evaluation and methods in topic detection and tracking, and explored its future development trend through analyzing current research.
  • Survey
    WU Youzheng, LI Haoran, YAO Ting, HE Xiaodong
    . 2022, 36(5): 1-20.
    Over the past decade, there has been a steady momentum of innovation and breakthroughs that convincingly push the limits of modeling single modality, e.g., vision, speech and language. Going beyond such research progresses made in single modality, the rise of multimodal social network, short video applications, video conferencing, live video streaming and digital human highly demands the development of multimodal intelligence and offers a fertile ground for multimodal analysis. This paper reviews recent multimodal applications that have attracted intensive attention in the field of natural language processing, and summarizes the mainstream multimodal fusion approaches from the perspectives of single modal representation, multimodal fusion stage, fusion network, fusion of unaligned modalities, and fusion of missing modalities. In addition, this paper elaborate the latest progresses of the vision-language pre-training.
  • Sentiment Analysis and Social Computation
    LIANG Jun, CHAI Yumei, YUAN Huibin, ZAN Hongying, LIU Ming
    . 2014, 28(5): 155-161.
    Chinese micro-blog sentiment analysis aims to discover the user attitude towards hot events. Most of the current studies analyze the micro-blog sentiment by traditional algorithms such as SVM, CRF based on hand-engineered features. This paper explores the feasibility of performing Chinese micro-blog sentiment analysis by deep learning. We try to avoid task-specific features, and use recursive neural networks to discover relevant features to the tasks. We propose a novel model - sentiment polarity transition model - based on the relationship between neighboring words of a sentence to strengthen the text association. The proposed method achieves a performance close to state-of-the-art methods based on the hand-engineered features, but saving a lot of manual annotation work.
  • Survey
    LIN Wangqun, WANG Miao, WANG Wei, WANG Chongnan, JIN Songchang
    . 2020, 34(12): 9-16.
    Knowledge graph describes the concept, entity and their relationship in the form of semantic network. In this paper, we formally describe the basic concepts and the hierarchical architecture of knowledge graph. Then we review the state-of-the-art technologies of information extraction, knowledge fusion, schema, knowledge management. Finally, we probes into the application of knowledge graph in the military field, revealing challenges and trends of the future development.
  • Article
    ZHANG Hainan, WU Dayong, LIU Yue, CHENG Xueqi
    . 2017, 31(4): 28-35.
    Chinese NER is challenged by the implicit word boundary, lack of capitalization, and the polysemy of a single character in different words. This paper proposes a novel character-word joint encoding method in a deep learning framework for Chinese NER. It decreases the effect of improper word segmentation and sparse word dictionary in word-only embedding, while improves the results in character-only embedding of context missing. Experiments on the corpus of the Chinese Peoples' Daily Newspaper in 1998 demonstrates a good results: at least 1.6%, 8% and 3% improvements, respectively, in location, person and organization recognition tasks compared with character or word features; and 96.8%, 94.6%, 88.6% in F1, respectively, on location, person and organization recognition tasks if integrated with part of speech feature.
  • WANG Hou-feng
    . 2002, 16(6): 10-18.
    Anaphora occurs throughout discourse or dialogue. Their high frequencies make anaphora resolution one key problem in discourse processing which attract attention of increasing researchers. In this article ,some issues of anaphora resolution will be discussed , such as basic concepts , special referring phenomena ,necessary knowledge on anaphora resolution. Some typical computational models of anaphora resolution and implement technologies will be given as well.
  • Sentiment Analysis and Social Computing
    BAI Ting, WEN Jirong, ZHAO Xin, YANG Bohua
    . 2017, 31(5): 185-193.
    Long-tail products, with low demands, occupy a significant share of total revenue in total. It is challenging to analyze the long-tail purchase behaviors due to the data sparsity resulted from few purchase behaviors. This paper proposes to leverage online social media information for predicting the long-tail purchase behaviors. In specific, we collect the user profiles form the social media information, including the status text, following links and temporal activity distributions, and predict their purchases by a weighted Multiple Additive Regression Trees (MART). Experimented on the data from JingDong and SinaWeibo, the effectiveness of the proposed method are revealed, together with several interesting findings.
  • Language Resources Construction
    YAO Yuanlin, WANG Shuwei, XU Ruifeng, LIU Bin, GUI Lin, LU Qin, WANG Xiaolong
    . 2014, 28(5): 83-91.
    Baidu(9)
    The research on text emotion analysis has made substantial progesses in recent years. However, the emotion annotated corpus is less developed, especially the ones on micro-blog text. To support the analysis on the emotion expression in Chinese micro-blog text and the evaluation of the emotion classification algorithms, an emotion annotated corpus on Chinese micro-blog text is designed and constructed. Based on the observation and analysis on the emotion expression in micro-blog text, a set of emotion annotation specification is developed. Following this specification, the emotion annotation on micro-blog level is firstly performed. The annotated information includes whether the micro-blog text has emotion expression and the emotion categories corresponding to the micro-blog with emotion expressions. Next, the sentence-level annotation is conducted. Meanwhile, the annotation on whether the sentence has emotion expression and the emotion categories, the strength corresponding to each emotion category is annotated. Currently, this emotion annotated corpus consists of 14000 micro-blogs, totaling 45431 sentences. This corpus was used as the standard resource in the NLP&CC2013 Chinese micro-blog emotion analysis evaluation, facilitating the research on emotion analysis to a great extent.
  • Review
    . 1994, 8(4): 43-54.
    本文详细分析了复句结构, 论述了用盒式图表示复句的形式结构, 用复杂特征集表示复句的意义结构前者直观易懂, 便于非专业人员理解, 后者从深层表示复句的意义, 便于计算机加工处理最后探讨了汉语复句的自动分析方法
  • DAI Liu-ling,HUANG He-yan,CHEN Zhao-xiong
    . 2004, 18(1): 27-33.
    Baidu(258)
    This paper is a comparative study of feature selection methods in text categorization. Four methods were evaluated, including document frequency (DF) , information gain (IG) , mutual information (MI) and χ2-test (CHI) . A Support Vector Machine (SVM) and a k-nearest neighbor (KNN) were selected as the evaluating classifiers. We found IG, MI and CHI had poor performance in our test , though they behave well in English text categorization. We analyzed the reasons theoretically and put forwarded the possible solutions. A furthermore experiment proved that the combined feature selection method is effective.
  • Sentiment Analysis and Social Computing
    LI Ao, DAN Zhiping, DONG Fangmin, LIU Longwen, FENG Yang
    . 2020, 34(9): 78-88.
    Existing rumor detection algorithms, including general sequential models, are defected in capturing text semantics and key features detection, resulting in poor generalization capability. To address this issue, this paper proposes an improved generative adversarial network model named TGBiA for rumor detection. TGBiA adopts adversarial training method, to capture the development of augmentation, detraction, exaggeration and distortion during its spread. Generator model extracts sequence semantics and features via Transformer instead of RNN. And the discriminator is a classification model based on BiLSTM, with the attention mechanism introduced. Through the mutual promotion of the generator and discriminator, it enables the learning of the indicative features of rumors increasingly. Experimental results on the Weibo and Twitter datasets show that the proposed method is not only outperforms other existing detecting methods but is also more robust.
  • Sentiment Analysis and Social Computing
    DU Chengyu, LIU Pengyuan
    . 2020, 34(9): 70-77.
    Aspect-level sentiment classification is a fine-grained sentiment analysis task, with the purpose to identify the sentiment polarity for a particular aspect. This paper proposes a BERT-based Helical Attention Networks (BHAN) which employ a helical attention mechanism to get a better representation of context and aspect. Specifically, on the basis of the weighted context representation based on averaged aspect vector, we use it to compute the attention weight of aspect. Then we use the new weight aspect representation to compute the context attention weight again. We can get a better representation of context and aspect by iterate above process until convergence. Evaluated on SemEval 2014 Task 4 and Twitter dataset, the proposed method out-performs the existing state-of-the-art methods.
  • NLP Application
    WANG Chencheng, YANG Liner, WANG Yingying, DU Yongping, YANG Erhong
    . 2020, 34(6): 106-114.
    Grammatical error correction is an important task in the field of natural language processing, which has attracted wide attention in recent years. This paper regards grammatical error correction task as a translation task to translate the wrong texts into the right ones. We use the transformer model with multi-head attention mechanism as framework, and propose a dynamic residual structure to combine the outputs of different neural blocks dynamically to better capture semantic information. Due to the lack of training corpus, we propose a data augmentation method to generate the parallel data by corrupting a monolingual corpus. The experimental results show that the proposed method based on dynamic residuals and data augmentation has significantly improved the performance of error correction, achieving the best performance on NLPCC 2018 Chinese grammatical error correction task.
  • Information Extraction and Text Mining
    ZHOU Ning, SHI Wenqian, ZHU Zhaozhao
    . 2020, 34(9): 44-52.
    TextRank algorithm based on graph model is an effective keyword extraction algorithm with high accuracy. However, when constructing the edges of a graph, the algorithm adopts the co-occurrence window rule that considers only the association between local words, yielding greater randomness and uncertainty. To address the issue, an improved TextRank keyword extraction algorithm based on rough data-deduction is proposed. In this method, candidate keywords are classified according to word meanings, and the association between candidate words in different classes is deduced by rough data-deduction. The experimental results show that the extraction precision of improved algorithm has been significantly improved.
  • Language Analysis and Calculation
    HUANG Haibin, CHANG Baobao, ZHAN Weidong
    . 2020, 34(9): 1-8.
    The paper introduces an approach to automatic annotation of Chinese constructions. Without annotated corpora as training data, it is difficult to extract the knowledge of various constructions. To address this issue, we apply the unsupervised method based on Gaussian Mixture Model, the token position features, the linguistic features of construction as well as the regular expressions to capture the structure of the instruction, especially when the boundary is hard to be identified. Comparing to the results annotated by regular expression and part-of-speech, the proposed method achieves improvements on F1 by 17.9% (for semi-concretionary constructions), 19.3% (for phrasal constructions) and 14.9% (for sentential constructions).
  • Language Resources Construction
    CAO Ziyan, FENG Minxuan, MAO Xuefen, CHENG Ning, SONG Yang, LI Bin
    . 2020, 34(9): 28-35.
    The product review is an important research object of sentiment analysis. At present, most of the existing product review corpus are relatively coarse, and the three elements of the target, attribute and polarity are not always annotated. The paper constructs a fine-grained emotional corpus of 9,343 short texts on car reviews. The target, attribute and polarity are all annotated for specific words, and further associated with the ontology tree of the products and attributes. The implicit expressions without sentimental words and special texts (such as suggestion remarks, comparative sentences, etc.) are also annotated by specific labels with corresponding triples. The statistics shows the co-occurrence of the target and attribute is as high as 77.54%, indicating it is necessary to provide complete annotation for sentiment corpus. The experiment on automatic annotation achieves up to 70.82% F1-score.
  • Language Resources Construction
    WANG Chengwen, QIAN Qingqing, XUN Endong, XING Dan, LI Meng, RAO Gaoqi
    . 2020, 34(9): 19-27.
    The research on semantic roles has always been a significant challenge in the field of linguistics. Some resources on the semantic relations have been constructed; however, most of the domestic researches on Chinese word semantic relations focuses on the labeling. This paper proposes a novel structure, the ternary collocation, to describe the semantic relations with verbs at the core. The paper also puts forward a semantic role classification scheme, under which a semantic role bank for Chinese verbs is constructed. All the verbs involved are exhaustively identified for the possible semantic roles and other related knowledge annotation. Altogether 5,260 verbs are collected, among which 2,685 verbs are assigned with 4,307 semantic roles as well as the guiding word.
  • Sentiment Analysis and Social Computing
    YANG Liang, ZHOU Fengqing, ZHANG Li, MAO Guoqing, YI Bin, LIN Hongfei
    . 2020, 34(9): 89-96.
    In the process of trial in the field of justice, the prosecution and the defense often hold different views around the argument of the case, which is also the key factors to the final judgment of the case. To identify the arguments in the cases, this paper introduce the text summarization model since the composition of the argument mostly depends on the analysis and summary of the case text. We construct the generation model of the argument by combining the generative adversarial network, and then obtain the argument of the case. Experimented on the real judicial data obtained from the website of China Judgements Online, the results show that the proposed model improves the accuracy in the task of argument recognition. This method can be applied as an auxiliary role in the pre-court preplan and trial of the case for procuratorial personnel in real application.
  • Question-answering and Dialogue
    WANG Mengyu, YU Dingyao, YAN Rui, HU Wenpeng, ZHAO Dongyan
    . 2020, 34(8): 78-85.
    Multi-turn dialogue task requires the system to take care of context information while generating fluent answers. Recently, a large number of multi-turn dialogue models based on HRED(Hierarchical Recurrent Encoder-Decoder) model have been developed, reporting good results on some English dialogue datasets such as Movie-DiC. On a high-quality customer service dialogue corpus from real world to contestants released by Jingdong in 2018, this article investigates the performance of HRED model and explores possible improvements. It is revealed that the combination of the attention and ResNet mechanisms with HRED model can achieve significant improvements.
  • Information Extraction and Text Mining
    LI Wei, YAN Xiaodong, XIE Xiaoqing
    . 2020, 34(9): 36-43.
    For Tibetan text abstraction, this paper proposes an improved TextRank for Tibetan extractive summarization. This method integrates the information of the external corpus into the TextRank algorithm in the form of word vector. The sentence is represented by each word vector, which means sentence vector is applied for sentence scoring. We select the sentences with the highest scores and reorder them as a summary of the text. The experimental results demonstrate that the method can effectively improve the quality of the abstract according the ROUGE evaluation method.
  • Language Resources Construction
    TANG Qiantong, CHANG Baobao, ZHAN Weidong
    . 2020, 34(9): 9-18.
    This paper proposes a fine-grained evaluation scheme on Chinese POS Tagging. The key to this task is to determine the evaluation items and the samples (words) for each item. This paper presents an evaluation set of 5 873 sentences, totaling 2 326 words for 70 evaluation items. Several common open source POS taggers are evaluated. Finally, this paper discusses the advantages of the merits of this evaluation approach, especially in contrast to the classical methods.
  • Sentiment Analysis and Social Computing
    WANG Xiaohan, YU Zhengtao, XIANG Yan, GUO Xianwei, HUANG Yuxin
    . 2020, 34(9): 62-69.
    In the case related microblogs, the opinion sentence recognition should consider whether the comment discusses the topic of a specific case. To address this issue, this paper proposes an opinion sentence recognition model that combines the microblogs content as the feature. Under the framework of CNN, the vector of keyword in the case related microblog is concatenated with the corresponding comment word vector at the input layer. Experiments show that the accuracy of the model on two datasets of case related microblogs reaches 84.74% and 82.09%, respectively, with a significant improvement compared with the existing benchmarks.
  • ZHENG Shi-fu,LIU Ting,QIN Bing,LI Sheng
    . 2002, 16(6): 47-53.
    Baidu(145)
    Question-Answering is a hot research field in Natural Language Processing ,which includes many kinds of NLP technology. This paper introduces the current research status and the methods that are often used in Question-Answering. In general ,a Question-Answering system is made up of three parts : Question Analysis ,Information Retrieval and Answer Extraction. This paper describes the main functions of these three parts and the common approach used in these parts in detail. At last ,this paper introduces the evaluation of Question-Answering system.
  • ZHOU Qian,ZHAO Ming-sheng,HU min
    . 2004, 18(3): 18-24.
    Baidu(163)
    This paper introduces and compares eight feature selection methods in text categorization. Among the eight methods , Multi-Class Odds Ratio (MC-OR) , a variant of Odds Ratio which is often used in binary classification , and a new feature selection method based on Class-Discriminating Words (CDW) are proposed. Combined with the classic VSM classifier based on cosine similarity and the Na?ve Bayes classifier , training and test are carried out on two text sets with different class distribution. As the results indicate , MC-OR and CDW gain the best selecting effect.
  • DONGJing,SUN Le,FENG Yuan-yong,HUANG Rui-hong
    .
    Entity Relation Extraction is one of the important research fields in Information Ext raction. This paper present s a novel method through dividing the entity relations into two categories : embedding relations and non-embedding relations. After some simple experiments , we discover that some syntactic features have explicitly different effects on the identification of the two kinds of relations. So two different set of syntactic features are suggested to extract the two categories. Experiment s show that the new method achieves an improved performance on the ACE2007 Corpus for Chinese entity relation extraction task.
  • Review
    Xiao Ming , Hu Jinzhu , Zhao Hui
    . 1999, 13(6): 54-61.
    With windows operating system and TrueType font popularity , important of font technique more and more people of attention. This paper discover from TrueType 、OpenType to Clear Type up to the minute development , analyzing OpenType font file structure in detail , explain importance describe table in file . If user can hold OTF font file format rightly , can establish own special font and can gain well application effect in font field.
  • Review
    XU Lin-hong, LIN Hong-fei, ZHAO Jing
    . 2008, 22(1): 116-122.
    Baidu(81)
    This paper introduced some experiences on constructing emotional corpus, and discussed several basic questions which included the tagging criterion, tagging set, tagging tools and quality monitoring. There were about 40 000 sentences in the corpus. Moreover based on these, statistical data about emotional distribution and rules of emotional transference were available, and characters and applications of corpus were analyzed, so emotional corpus provide support for text affective computing.
  • Review
    ZHAO Jun, LIU Kang, ZHOU Guangyou, CAI Li
    . 2011, 25(6): 98-111.
    The research on information extraction is being developed into open information extraction, i.e. extracting open categories of entities, relations and events from open domain text resources. The methods used are also transferred from pure statistical machine learning model based on human annotated corpora into statistical learning model incorporated with knowledge bases mined from large-scaled and heterogeneous Web resources. This paper firstly reviews the history of the researches on information extraction, then detailedly introduces the task definitions, difficulties, typical methods, evaluations, performances and the challenges of three main open domain information extraction tasks, i.e. entity extraction, entity disambiguation and relation extraction. Finally, based on our researches on this field, we analyze and discuss the development directions of open information extraction research and its applications in large-scaled knowledge engineering, question answering, etc.
    Key wordsopen information extraction; knowledge engineering; text understanding
  • Information Extraction and Text Mining
    ZHANG Jiashuo, HONG Yu, LI Zhifeng, YAO Jianmin, ZHU Qiaoming
    . 2020, 34(9): 53-61.
    The attention-based encoder-decoder framework is widely used in image captioning. In previous methods, the single-directional attention mechanism does not check the consistency between semantic information and image content, causing low accuracy in the generated caption. In order to solve the above problem, this paper proposes an image captioning method based on bi-directional attention mechanism. On the basis of the single-directional attention mechanism, the attention calculation is added from image feature to the semantic information, enabling the interaction between the image and the semantic information in two directions. This paper designs a gated network to fuse information in the above two directions. In contrast to previous studies, this paper uses the historical semantic information to assist in current word generation in the attention module. Using two types of image features, the experimental results show that on MSCOCO dataset, the BLEU4 score is increased by 1.3 and the CIDEr score by 6.3 in average. And on Flickr30k, the BLEU4 score is increased by 0.9 and the CIDEr score by 2.4 in average.
  • Survey
    BYAMBASUREN Odmaa, YANG Yunfei, SUI Zhifang, DAI Damai, CHANG Baobao, LI Sujian, ZAN Hongying
    . 2019, 33(10): 1-7.
    The medical knowledge graph is the cornerstone of intelligent medical applications. The existing medical knowledge graphs are not enough from the perspectives of scale, specification, taxonomy, formalization as well as the precise description of the knowledge to meet the needs of intelligent medical applications. We apply natural language processing and text mining techniques with a semi-automated approach to develop the Chinese Medical Knowledge Graph (CMeKG 1.0) . The construction of CMeKG 1.0 refers to the international medical coding systems such as ICD-10, ATC, and MeSH, as well as large-scale, multi-source heterogeneous clinical guidelines, medical standards, diagnostic protocols, and medical encyclopedia resources. CMeKG covers types such as diseases, drugs, and diagnosis/treatment technologies, with more than 1 million medical concept relationships. This paper presents the description system, key technologies, construction process and medical knowledge description of CMeKG 1.0, serving as a reference for the construction and application of knowledge graphs in the medical field.
  • LIU Qun
    . 2003, 17(4): 2-13.
    Baidu(47)
    The paper gives a survey on three approaches of statistical machine translation and the evaluation methods used in SMT. The basic idea of parallel grammar based approach is to build parallel grammars for source and target languages , which conform the same probabilistic distribution. In the source-channel approach , the translation probability is expressed as a language model and a translation model. In the maximum entropy approach , the optimal translation is searched according to a linear combination of a series of real-valued feature functions. The source-channel approach can be regard as a special case of maximum entropy approach.
  • . 2005, 19(3): 54-54.
     这本书是美国的Christopher D. Manning 教授和德国的Hinrich Schutze 教授合著、清华大学苑春法教授组织翻译、并负责对全书进行了统一修改、审阅及定稿。该书是电子工业出版社的国外计算机科学教材系列中的一本系统介绍统计自然语言处理(或统计语言学) 专著,在国外已经被许多大学用来作为教材。在我国统计语言学也已成为自然语言处理研究中的主流,希望这本书对大家的研究及教学工作有所帮助。
  • NLP Application
    XUE Yang, LIANG Xun, XIE Hualun, DU Wei
    . 2020, 34(9): 97-110.
    A document embedding model is designed and trained over a corpus of 51 contemporary and Ming and Qing literary works including A Dream of Red Mansions.To achieve the optimal high-dimension document embedding vector to represent the semantic characteristics of words and document topics, the document embedding matrix and loss function of different authors are defined according to the unitary invariance of document embedding vector. An authorship identification method is designed by an unsupervised manifold learning dimensionality reduction mapping algorithm and a supervised classification algorithm. The classification accuracy of the known authors reaches 99.6%, even authors with similar styles such as Lu Yao and Chen Zhongshi can be effectively distinguished. The variable-scale sliding window classification model is further proposed to conduct an in-depth analysis of A Dream of Red Mansion. It is found that the first 80 chapters and the last 40 chapters may come from different authors, and there are also some style differences between the first 100 and the last 20 chapters.