2011 Volume 25 Issue 6 Published: 15 December 2011
  

  • Select all
    |
    Review
  • Review
    DONG Zhendong1,DONG Qiang2,HAO Changling2
    2011, 25(6): 3-12.
    Abstract ( ) PDF ( ) Knowledge map Save
  • Review
    YU Shiwen, SUI Zhifang, ZHU Xuefeng
    2011, 25(6): 12-21.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since 1986, the Institute of Computational Linguistics at Peking University has been working on the Comprehensive Language Knowledge Base (CLKB), which consists of 6 language knowledge bases, 10 specifications and standards, 4 application systems and a software tool kit. These components provide support for each other and integrate into CLKB to describe linguistic knowledge on morphological, syntactic and semantic levels. The language data that have been collected include words, phrases, sentences and discourse in Chinese and many other languages, which occur in specific fields as well as the general domain. After 25 years of development, significant progress in CLKB has been made, and it is still growing. This paper gives an introduction to CLKB and explores its potential in the future.
    Key wordsnatural language processing; computational linguistics; language engineering; comprehensive language knowledge Base; grammatical knowledge-base of contemporary Chinese
  • Review
    HUANG Changning
    2011, 25(6): 21-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    The DeepQA question answering system of IBM beat two human champions on U.S. Jeopardy Show in 14th-16th February, 2011. It obviously shows that Watson, IBM’s super computer, outperforms human intelligent behaviors. It is a great milestone in the history of Artificial Intelligence. This paper reviews some gain and loss of natural language processing and automatic question answering technologies. The paper is written for the thirty-year ceremony of the Chinese Information Processing Society of China.
    Key wordsartificial intelligence; natural language processing; automatic question answering; deepQA; IBM watson
  • Review
    SUN Maosong
    2011, 25(6): 26-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article proposes an idea of “natural language processing based on naturally annotated Web resources”. The discussion is carried out from three perspectivesthe definition and types of naturally annotated resources, naturally annotated resource-based computing, as well as several key points concerned at the methodological level. A fundamental problem is presented for further exploration at lastIf we could explore and integrate all the information provided by all the available naturally annotated resourcesin different respectssystematically, can themachine, as expected, ultimatelyachieve some degree of deep understanding of naturallanguage?
    Key wordsnaturally annotated resource; User generated data;web; natural language processing
  • Review
    ZHOU Ming
    2011, 25(6): 33-38.
    Abstract ( ) PDF ( ) Knowledge map Save
    The ideal search engine should not only be able to find the information but also should be able to provide personalized service. Furthermore, it should act like as a domain expert to help the user to make decision and finish the task efficiently. We believe that there is an unprecedented opportunity for internet search engine. The big number of internet users has provided enormous space for the advance of internet search engines. People realize that the rapid development of social network and mobile internet will certainly change the structure of search engine. The capability of NLP for the understanding of user intent and document will continually improve the search quality. However, the success of a new search engine is the result of multiple factors including both advanced technology and an effective strategy. This paper analyzes the technical trends of search engine, discusses the important areas that we should put more efforts and presents the strategy of developing the next generation of search engine.
    Key wordssearch engine; R&D strategy; real-time search; social network
  • Review
    Benjamin K. Tsou1,2, Oi Yee Kwong2, LU Bin1,2, Wing Fu Tsoi1
    2011, 25(6): 38-46.
    Abstract ( ) PDF ( ) Knowledge map Save
    The advancement of information technology and the Internet has offered important solutions to many classical problems in Chinese natural language processing. It has also opened up new opportunities for corpus linguistics, particularly the cultivation and utilization of large corpora for monitoring and tracking various language phenomena from the linguistic perspective, and investigating such language development in relation to the underlying social and cultural implications traditionally studied by humanities and social sciences. Over the past 17 years, the LIVAC corpus has grown into a very large corpus of its kind, containing results from the analysis of about 400 million Chinese characters drawn from news media from 7 communities of pan-Chinese regions. The long-term effort behind LIVAC has enabled it to function as serial time capsules, which provide a solid foundation for scientifically tracking and monitoring various phenomena of language changes together with the associated social and cultural developments within and across pan-Chinese regions. This paper introduces how the LIVAC synchronous corpus has evolved into a monitoring corpus of Chinese communities.
    Key wordscorpus linguistics; LIVAC corpus; synchronous corpus; monitoring corpus
  • Review
    LIU Kaiying
    2011, 25(6): 46-53.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese FrameNet is an computational lexical semantic database which is based on the frame semantics by Fillmore, referring to the FrameNet by the California University at Berkeley and supported by Chinese real copus. This article introduces the CFN’ basic theoryFrame Semantics and the English FrameNet project. Then it analyzes the constructing technology, introduces the ongoing researches on automatic semantic annotation based on CFN. the experimental results on all 25 frames data for the precision, the recall, and F1-value reached 74.16%, 52.70%, 61.62% respectively. At last, it introduces the situation of some researches based on CFN.
    Key wordsChinese Framenet;semantic role labeling;frame semantic dependency graph
  • Review
    LIU Ting, CHE Wanxiang, LI Zhenghua
    2011, 25(6): 53-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Information Processing not only needs the support of basic data platform, but also needs basic technology platform. This paper introduces our “Language Technology Platform” which has been developed and improved during the past eight years. This platform is composed of several Chinese processing technologies, including word processing, syntactic parsing and semantic parsing. Among these technologies, our syntactic and semantic parsing system winned the first place in the Conference on Computational Natural Language Learning (CoNLL) 2009 shared task. We began to freely share it to the acadamic circle since 2006. and LTP got the First Prize of WeiChang Qian Chinese Information Processing Scientific and Technological Award issued by Chinese Information Processing Society of China in 2010. So far, More than 400 research institutes have shared our platform. In June 2011, we released the source codes, and hoped that the users may better study on application technologies based on our platform, and they may modify the system together with us.
    Key wordsChinese information processing;language technology platform
  • Review
    LIU Qun
    2011, 25(6): 63-72.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper gives an overview of our research work on syntax-based statistical machine translation in recent year. Our work especially focused on a series of syntax-based statistical machine translation models and approaches, which includesthe translation model based on maximum entropy bracketing transduction grammar, source phrase structure based tree-to-string translation model and the translation approaches based on it—tree-based approach, forest-based approach, and joint parsing and decoding approach, and the source dependency based translation model.
    Key wordsstatistical machine translation; syntax-based translation model; syntax-based translation approach
  • Review
    WANG Haifeng, WU Hua, LIU Zhanyi
    2011, 25(6): 72-81.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper digs into the characteristics and challenges of web-based machine translation, and proposes possible solutions. First of all, we look back on the history of machine translation and summarize its methods. Next, we analyze the characteristics of internet bilingual corpora and monolingual corpora aslarge scale, with lots of noise, real-time and sometimes sparse. Based on the features described above, we propose the hybrid machine translation method, corpus mining and filtering methods, and distributed computing methods. Furthermore, the pivot language approach is adopted to tackle the data sparseness problem, thus enabling the quick development of multilingual machine translation systems. We then discuss the approach to support the personalization of machine translation via the combination of translation technology and search technology. Finally the applications and products of machine translation technology are presented.
    Key wordsweb-based machine translation; hybrid machine translation; combination of translation and search
  • Review
    ZHAO Tiejun, CAO Hailong
    2011, 25(6): 81-90.
    Abstract ( ) PDF ( ) Knowledge map Save
    We reviewed the progress and achievement on multi-lingual information processing by MI&T Lab of HIT. We briefly surveyed the related research and then introduced our works on statistical machine translation, machine translation application, machine translation evaluation and cross-language information retrieval.
    Key wordsmachine translation;multi-lingual information processing; natural language processing
  • Review
    MA Shaoping1, LIU Yiqun1, LIU Jian 1, ZHANG Min1, ZHU Jianhua2, RU Liyun1
    2011, 25(6): 90-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Search engine has been one of the most important information acquisition tools on the Web. To meet users’ information needs, most commercial search engines rely on user behavior analysis to improve the performance of result ranking, data quality estimation, Web spam detection and other related techniques. However, these works seldom focus on long-term dynamic analysis of user behavior, which may be essential for both system architecture and user interface designing of future search techniques. Based on a large-scale user behavior data provided by a most popular Chinese search engine, search behavior between 2006 and 2011 was studied, producing many findings which may help us better understand how users grow with search engines.
    Key wordssearch engine; user behavior analysis; dynamic analysis
  • Review
    ZHAO Jun, LIU Kang, ZHOU Guangyou, CAI Li
    2011, 25(6): 98-111.
    Abstract ( ) PDF ( ) Knowledge map Save
    The research on information extraction is being developed into open information extraction, i.e. extracting open categories of entities, relations and events from open domain text resources. The methods used are also transferred from pure statistical machine learning model based on human annotated corpora into statistical learning model incorporated with knowledge bases mined from large-scaled and heterogeneous Web resources. This paper firstly reviews the history of the researches on information extraction, then detailedly introduces the task definitions, difficulties, typical methods, evaluations, performances and the challenges of three main open domain information extraction tasks, i.e. entity extraction, entity disambiguation and relation extraction. Finally, based on our researches on this field, we analyze and discuss the development directions of open information extraction research and its applications in large-scaled knowledge engineering, question answering, etc.
    Key wordsopen information extraction; knowledge engineering; text understanding
  • Review
    CHENG Xueqi, GUO Jiafeng, JIN Xiaolong
    2011, 25(6): 111-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the vigorous development of the Internet, the massive Web information has evolved into the largest data source thus far. As a consequence, how to utilize the massive Web information, provide intelligent applications to users, and well satisfy their information needs, have become a challenging issue in the Internet community, which gives rise to extensive studies on Web information retrieval and mining. Upon the recent research progress and practices in Web information related fields, this paper summarizes and analyzes the history, problems, and trends of Web information retrieval and mining from three specific perspectives, namely, information representation, information retrieval, and information mining.
    Key wordsinformation representation; information retrieval; information mining
  • Review
    HUANG Xuanjing, ZHANG Qi, WU Yuanbin
    2011, 25(6): 118-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, sentiment analysis (opinion mining) has received increasing attentions and becomes one of the most popular research topics in natural language processing, information retrieval and data mining. Along with the indepth researches, a large number of new problems have been discovered and a variety of novel sentiment analysis methods have been proposed. In this survey, we focus on summarizing and analyzing the classical and novel methods on this topic. Firstly, we discuss the definition of the sentiment analysis and summarize the expression methods under different definitions. The next sections cover the techniques on sentiment classification, feature-based sentiment analysis, benchmark dataset, evaluation and application. Finally, we conclude the topic and provide a brief description about the trends of sentiment analysis. Research communities in China have put much effort on this topic, and achieved a number of influential results and published lots of high-impact papers. In this paper, we will also pay attention to them.
    Key wordssentiment analysis; opinion mining; sentiment classification; survey
  • Review
    HU Yu, LING Zhenhua, WANG Renhua, DAI Lirong
    2011, 25(6): 127-137.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces acoustic statistical modeling based speech synthesis technologies. Emphasis is on the research progress contributed by USTC iFLYTEK speech laboratory, which includesintegrate articulatory features and acoustical features for improving the flexibility of acoustical parameters generation; propose a minimum generation error criterion to replace maximum likelihood for improving the synthesized speech quality; use unit selection and waveform concatenation to replace parametric synthesizer and avoid the limitation of speech quality in HMM based parametric synthesis. These innovative techniques improve the performance of speech synthesis systems in naturalness, expressiveness, flexibility and multilingual ability etc. These progresses have made speech synthesis technologies to be widely used in fields of information service of call center, human-machine speech interaction of mobile embedded devices and intelligent speech enabled electronic education systems.
    Key wordsspeech synthesis; hidden Markov model; parametric synthesis; unit selection
  • Review
    CAI Lianhong1, JIA Jia1, ZHENG Fang2
    2011, 25(6): 137-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper introduces the progress of speech information processing, especially the researches on Chinese speech processing. Speech information processing includes speech recognition, speaker recognition, speech synthesis and computational speech perception. Researches on speech recognition with accent and personal style support the systems of language learning and evaluation, while speaker recognition focuses on how to improve the performance in different conditions. Researches on speech synthesis pay more attention on cross-language, emotional and audio-visual speech synthesis. Fomputational speech perception focuses on the implementation on speech testing and rehabilitation, denoising, and speech enhancement. Through these researches, especially the combination of speech information processing, linguistics and web technology, we can build more harmonious human-computer speech interaction system.
    Key wordsspeech recognition; speaker recognition; speech synthesis; computational speech perception
  • Review
    WANG Shijin, LI Hongyan, KE Dengfeng, LI Peng, GAO Peng, XU Bo
    2011, 25(6): 142-149.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese education is facing a vital problem of exploring the effective way for Chinese people learning English and the minorities learning Mandarin Chinese. Research on intelligent techniques for objective and fair assessment and diagnosis of oral language has great importance for the promotion of computer-assisted language learning (CALL). According to the recent demands for large-scale oral assessment in high school English proficiency test and ethnic minority Chinese proficiency tests (MHK), this article summarizes technical advances for the content recognition and rejection, and for the assessment of pronunciation, fluency and prosody in Institute of Automation, Chinese Academy of Sciences.
    Key wordschinese information process; CALL; pronunciation assessment; fluency assessment; rhythm assessment
  • Review
    Turgun·Ibrahim, YUAN Baoshe
    2011, 25(6): 149-157.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article focuses on the the minority languages information processing in Xinjiang, inluding Uygur, Kazak, and Kirgiz et al, providing a brief sruvey on the research development and the technology prospects It summarizes the history and state-of-arts of the operating system, the information technology standard, the language information processing and the application system for Uygur, Kazak and Kirgiz language. The furture research directions for these longuages are finally outlined. This paper is an effrot to investigate the way to imporve the research and technology development for minority language informaiton in Xinjiang
    Key wordsUighur; Kazakh; Kirgiz; information processing; operating system; natural language; standard
  • Review
    2011, 25(6): 157-162.
    Abstract ( ) PDF ( ) Knowledge map Save
    The costruction of bilingual Corpus and its automatic alignment research are of vital importance for the development of the computational linguistics. So far various types of Chinese-English bilingual corpus, including substantial sentnece aligned corpus for MT, have been developed both in China and abroad. In order to start the MT research involving minority with the state-of-arts technology, the research on the automatic alignments at the discourse level, paragraph level and sentence level between the Chinese and Tibetan vi-texts are necessary. This paper introduces a project on the Sino-Tibetanbilingual corpus alignments, the Chinese -Tibetan bilingual dictionary extraction, and the key technologies in the corpus collection, storage and retrieval. The project has accomplished such technologies as the Tibetan coding identification and conversion, th Tibetan corpus construction, the Sino-Tibetan bilingual dictionary extraction, the Sino-Tibetan sentence alignment and word alignments, and finally achieving a large-scale aligned Sino-Tibetan bilingual corpus for Chinese-Tibetan machine translation.
    Key wordsChinese-Tibetan machine translation; Chinese-Tibetan bilingual corpus; coding; alignment technology
  • Review
    2011, 25(6): 162-166.
    Abstract ( ) PDF ( ) Knowledge map Save
    It’s essentially important to build a comprehensive Mongolian language knowledge bank to support all kinds of Mongolian language processing systems. We have completed certain parts of the language bank, yet we still have many theoretical and technological issues to deal with. This paper introduces the main structure and contents of Mongolian language knowledge bank in the first section, and its applications and the problems we are confronted with are discussed in the following sections.
    Key wordsMongolian; knowledge base; language resources; semantic information; semantic dictionary
  • Review
    BI Yude
    2011, 25(6): 166-170.
    Abstract ( ) PDF ( ) Knowledge map Save
    As a cross-language, Korean natural language processing in China, North Korea and South Korea has been carried out in certain areas.. This paper provides a brief overview of the Korean language processing research in the three countries in three areas, namely, basic researches, the resource development and applied researches (including system development), and puts forward some suggestions for Korean natural language processing research and development in China.
    Key wordsnatural language processing; Korean; China; North Korea; South Korea
  • Review
    Samarayi
    2011, 25(6): 170-175.
    Abstract ( ) PDF ( ) Knowledge map Save
    After 30 years research and development, the technologies of Yi language processing have been widely applied in the newswire and the publication, the education and the research, the government and offices as well as in national meetings of the party, the National People’s Congress and the People’s Political Consultative Conference. The ancient language of Yi nationality has grown out of the age of lead and flame, stepping into the era of electronic and laser typesetting. The Yi lanauge processing technology has advanced the IT development in the Yi society, and thus lead an unique role in the modernization of the Yi nationality.
    Key wordsYi language; information processing; achievements in 30 years
  • Review
    LIU Lianfang, GU Lin, HUANG Jiayu,WEN Jiakai
    2011, 25(6): 175-183.
    Abstract ( ) PDF ( ) Knowledge map Save
    Languages of the Zhuang nationality can be divided into ancient Zhuang language and modern Zhuang language. The ancient Zhuang language mainly exists in ancient books as well as among the daily use by the folk. For this active branch of Zhuang language, the character-defining/editing/typesetting tools, the dictionaries and the pronunciation-borrowing ancient Zhuangese database have been developed in Guangxi aince 1990. Modern Zhuang language were created in 1955, which are formally used in education (illiteracy-elimination), , in the offical documents of the the government, over the currency(RMB), and in the public places in Guangxi. For this branch, the editing tools, English-Chinese-Zhuang dictionaries and CAT software have been developed. The further research and development of Zhuang information processing lie in the full text retrieval and mutual translation between ancient Zhuang, modern Zhuangese, Chinese and English, so as to facilitate the education, the publication, the communications and the heritage protection of Zhuang language.
    Key wordsancient Zhuang language; modern Zhuang language; information processing