2018 Volume 32 Issue 3 Published: 15 March 2018
  

  • Select all
    |
    Language Recognition and Its computing
  • Language Recognition and Its computing
    YAO Dengfeng, JIANG Minghu, CHANG Jung-hsing, Abudoukelimu Abulizi
    2018, 32(3): 1-8.
    Abstract ( ) PDF ( ) Knowledge map Save
    Classifier predicates is a unique language phenomenon in sign language. Chinese scholars-study on classifier predicates has just started, without yielding a systematic report. This paper attempts to explain the phenomenon of classifier predicates from the perspective of linguistics. It explains the classifier predicates in Chinese sign language, and analyzes how the figure and the ground proforms formed to achieve synchroneity and sequentiality requirements of sign language by combined analysis of Talmy‘s Dynamic events and proforms. It reveals that the figure proforms and ground proforms are usually made by non-moving hand shape.It also illustrates the intermotion of sign language and Chinese, making a detailed description and classification to the two sign preforms “motion” and “be located”.
  • Language Recognition and Its computing
    CHEN Xiuqing
    2018, 32(3): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    This study discusses the formation mechanism and constraints on redundant negation by themethod of combining quantitative analysis with qualitative analysis. The formation of redundant negation is the syntactic realization of semantic hidden negation, on the basis on implicit negative words. Implicit negative words also lay essential shape restrictions to the formation of redundant negation. The negative meaning of implicit negative words vary according to different sources, leading to the different abilities in redundant negation. These words with negative meaning form entailment have the strongest ability to form redundant negation, followed by those from presupposition, and those from conversational implication, with those form assertion have the least ability to from the redundant negation.. This statistics reveals this with a clear steep curve.
  • Morphology, Syntax, Semantics
  • Morphology, Syntax, Semantics
    ZHANG Jing,HUANG Kaiyu, LIANG Chen, HUANG Degen
    2018, 32(3): 17-25,33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Aiming to extract new words from Chinese social media data, a novel unsupervised method which utilizes traditional statistical measure and word embedding is proposed. Traditional statistical measure is applied to extract new word candidate list from segmented social media corpus, and then word embedding is trained via multi-strategies to filter out noises from the new word candidate list by constructing anti-word set which contains segments that are less likely to become a new word combining with other segments. Besides, in order to analyze traditional statistical measure and tuning thresholds, we annotated 10,000 tweets as development corpus, which is proved to be reliable by the experimental results. To assess the proposed method, the corpus released as training corpus by the evaluation of microblog-oriented Chinese word segmentation in NLPCC2015 is used as test corpus. The results show that our method significantly improves the new word extraction performance comparing to the baseline systems. The bigram and trigram new word extraction result on test corpus reaches 56.2% in F1-measure.
  • Knowledge Representation and Acquisition
  • Knowledge Representation and Acquisition
    DUAN Pengfei, WANG Yuan, XIONG Shengwu, MAO Jingjing
    2018, 32(3): 26-33.
    Abstract ( ) PDF ( ) Knowledge map Save
    Humanoid intelligence has developed rapidly and it benefits from the complete knowledge graph especially elementary education knowledge graph represented by geography. The traditional knowledge graph is represented by network leading to high computation complexity. This paper puts forward a new algorithm named PTransW (Path-based TransE and Considering Relation Type by Weight). It combines the space projection with the semantic information of relation path, taking advantage of the semantic information of relation type for further improvement. The experiment results on the FB15K and GEOGRAPHY data sets show that the ability of dealing with complex relation in knowledge graph is improved significantly by PTransW model.
  • Knowledge Representation and Acquisition
    JIANG Tianwen, QIN Bing, LIU Ting
    2018, 32(3): 34-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    Knowledge bases are usually represented as networks with nodes for entities and edges for relations. To utilize knowledge bases, people have to design algorithms with high complexity, which are not suitable for knowledge reasoning, or especially for expanding knowledge bases. This paper uses TransE-based model to learn representation for knowledge reasoning, including the study on relation indication reasoning and tail entity reasoning. The experiments on relation indication reasoning get an excelletn result without designing complex algorithms: just some simple operations in vector space. Besides, we improve original loss function of TransE model to be more suitable for open domain Chinese knowledge bases in representation learning.
  • Machine Translation
  • Machine Translation
    ZHANG Xueqiang, CAI Dongfeng, YE Na, WU Chuang
    2018, 32(3): 42-48,63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Neural Machine Translation (NMT) is defected in long sentences with complex structure owing to its neglect of linguistic knowledge of sentence structure. Adopting the idea of divide-and-conquer strategy, this paper proposes to identifying and extracting the Maximal Noun Phrases in a sentence, and retaining special marks or head words and the rest component to form the sentence framework. Then the Maximal Noun Phrases and sentence frames are translated by NMT, respectively. Experimental results show that the method proposed yields 0.89 imporovments in terms of BLEU score compared with the baseline system.
  • Ethnic Language Processing and Cross Language Processing
  • Ethnic Language Processing and Cross Language Processing
    LIU Ruolan, NIAN Mei, Maierhaba Aisaiti
    2018, 32(3): 49-54.
    Abstract ( ) PDF ( ) Knowledge map Save
    Emotion words are the fundamental resource for accurately analysis the opinions of the Uighur language. We investigates the automatic expansion of the web emotional words on the basis of an existing Uighur sentiment lexicon. First, we summarize the collocation rules of the conjunctions, degree adverbs and sentiment words by analyzing the linguistic features of Uighur emotional expression. Based on the rules, we design an algorithm to obtain the candidate emotional words from emotional corpus, forming the candidate sentiment lexicon. Finally, we use the Internet as a super-large corpus to design the emotional discriminant algorithm based on search engine by reusing the characteristics of Uighur conjunctions and combining with the established emotional lexicon and Uighur antonyms dictionary. The polarity of candidate emotional words is decided according to the score calculated by the algorithm, and then add them to the emotional lexicon. Compared with the emotional lexicon that was not expanded, the experimental results showed that the accuracy and recall rate of Uyghur sentence‘s tendency are significantly improved by our extended dictionary.
  • Ethnic Language Processing and Cross Language Processing
    LIU Jiao, CUI Rongyi, ZHAO Yahui
    2018, 32(3): 55-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper analyses the cross lingual document similarity measure between different languages, including Chinese, English, and Korean. Initially, this paper maps a document vector in a language to another by co-occurrence information. The Latent Semantic Analysis is then employed to remedy the lack caused by polysemy across languages. Finally, the cosine similarity between two documents is calculated in the same space with equivalent semantic information. This method does not rely on a pre-existing external dictionary and knowledge base, but use the parallel corpus to establish the lexical relationship between Chinese, English, and Korean. It turns out that co-occurrence mapping contributes substantially to documents similarity measure, resulting an 95% accuracy of translation retrieval.
  • Ethnic Language Processing and Cross Language Processing
    TANG Liang, XI Yaoyi, PENG Bo, LIU Xiangwei, YI Mianzhu
    2018, 32(3): 64-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes a novel word vector based cross lingual event retrieval method for Vietnamese and Chinese. First, the Chinese event keywords are computed for their semantic feature vector via the word vector. Then, the corresponding Vietnamese translation vectors are computed. Finally, the cross-language keyword alignment is calculated via the similarity between the two semantic feature vector spaces. The input query can thus be mapped into the other language, and the cross lingual event retrieval is realized. Experiments on the South China Sea events related Vietnamese-Chinese bilingual corpus have shown the effectiveness of the method.
  • Ethnic Language Processing and Cross Language Processing
    LONG Congjun, Douge Tsiring, LIU Huidan
    2018, 32(3): 71-76.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the development of information technology, the Tibetan language was widely used on the Internet. To deal with the transliteration issue form Chinese texts to Tibetan, this paper collects five Tibetan website texts and examines the unified forms of transliterations. After analyzing the causes of confusion transliteration between Chinese and Tibetan, this paper proposes some transliteration principles. It also suggests that relevant government agencies should actively promote the standardization of transliteration norms.
  • Information Extraction and Text Mining
  • Information Extraction and Text Mining
    LI Yichen, HU Po, WANG Lijun
    2018, 32(3): 77-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    It is often time-consuming and laborious for a journalist to write sports news. In this paper, we propose a neural network model to automatically generate the sports news on the basis of the sports live scripts. The model avoids the manual feature extraction. Besides, it can also consider sentence-level information and global information within the script as well as the semantic relevance between sentences and corresponding news content in the scripts. The experimental results on the open data set verify the feasibility and effectiveness of the proposed method. In addition, we also try to generate the title of sports news based on rules and templates to extract the key content of it.
  • Information Extraction and Text Mining
    WANG Lei, XIE Yun, ZHOU Junsheng, GU Yanhui, QU Weiguang
    2018, 32(3): 84-90,100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Named Entity Recognition (NER) is an important task for Chinese information processing. In this paper, we explores the segmental neural architectures for Chinese NER, by regarding the task as a joint segmentation and labelling problem. The proposed methods can learn the effective segment-level representation and contextual information and then assign tags to the segments. The experimental results on MSRA corpus show that our method can achieve comparable performance with the state-of-the-art systems for Chinese NER and the F1 value gets 90.44%.
  • Information Extraction and Text Mining
    ZHANG Chengzhi, MA Shutian, KIT Chunyu, YAO Xuchen
    2018, 32(3): 91-100.
    Abstract ( ) PDF ( ) Knowledge map Save
    Parallel corpora are one of the most important resources for natural language processing, a large volume of which can be mined from bilingual parallel web pages. This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns (or keys), then this paper extends it in two ways to find more parallel web pages, namely, rescue weak keys of low local credibility in terms of their global credibility, and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility. Furthermore, we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use, and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility. To further enhance the recognition accuracy on top of these five methods, we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold. The effectiveness of our approaches is confirmed by a series of experiments.
  • Information Extraction and Text Mining
    LI Siliang, XU Bin, YANG Yuji
    2018, 32(3): 101-109.
    Abstract ( ) PDF ( ) Knowledge map Save
    Term extraction is an essential task where terms are extracted automatically from unstructured text based on a specific domain. Previous methods largely rely on terms-statistic information. However, terms in k12 education area have serious long-tail effect, which makes it hard to extract terms at the tail part in methods based on statistics. In this paper, we propose DRTE, a method which focus on extracting terms from their definitions and relations. Our method also utilizes term-formation rules and boundary detection strategies. Experiments on math textbooks for middle school and high school reveal 82.7% on F1 performance of our method, which significantly outperforms the current method by 40.8%.
  • Information Retrieval and Question Answering
  • Information Retrieval and Question Answering
    LIU Dexi, FU Qi, WEI Yaxiong, WAN Changxuan, LIU Xiping, ZHONG Minjuan, QIU Jiahong
    2018, 32(3): 110-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    Social short texts, coming from Twitter, Sina Microblog, etc., are limited in length but bear diversified topics, complex social relationships, as well as strong correlation with Web pages. Therefore, the traditional information retrieval methods are not suitable for the socialized short texts. The paper proposes a social short text retrieval method, SSTR, based on multiple-enhanced graph. The multiple-enhanced graph algorithm is based on Markov chain theory, where three types of relationships between short texts, authors, and tokens are considered. In SSTR, topic model based on LDA is employed when computing the similarity between short texts, which could overcome the disadvantages of TF-IDF feature. Experimental results show that, compared to cosine similarity based and LDA based re-ranking, SSTR produces better re-ranking result.
  • Information Retrieval and Question Answering
    GU Rong, WANG Shanyong, GUO Chen, YUAN Chunfeng, HUANG Yihua
    2018, 32(3): 120-134.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid growth of semantic data in recent years, the backward chaining reasoning method appears to be a new research direction since it is insensitive to data changes. Due to its complex reasoning procedure and large searching space of rule extensions, the backward chaining reasoning is still defected in the reasoning performance and scalability. This paper makes a thorough analysis on the characteristics of semantic rules based upon previous works about backward chaining reasoning technique, and proposes the design of a parallelized large-scale backward chaining reasoning engine of high efficiency and scalability over apache spark, the state-of-art big data processing platform.
    The main contributions of this paper can be summarized as: 1) It avoids duplicate reasoning of terminological patterns during real-time reasoning by pre-calculating terminological closure; 2) An optimization methods for reverse reasoning procedure and querying procedure is designed for an improved performance; 3) Spark based implementation of the proposed algorithm is presented. Experimental results on both synthetic datasets and real-world datasets show that our method requires only several seconds to ten-seconds of reasoning over hundreds of millions triples, maintaining high data scalability and node scalability as well.
  • NLP Application
  • NLP Application
    QIAO Pei,WANG Suge,CHEN Xin,TAN Hongye,CHEN Qian,WANG Yuanlong
    2018, 32(3): 135-142.
    Abstract ( ) PDF ( ) Knowledge map Save
    Substantial semantic gap exists between the questions words and the article words in the reading comprehension test for Chinese of the college entrance examination, which may derive from the complexity and diversity of questions, abstract semantic meaning of words, and the rich and implicit semantics of articles. To address this issue, this paper investigates the words association. Specifically, all the words in the corpus are clustered into topics through LDA, which is then filtered by the part-of-speech and frequency, and augmented by the lexeme-related words according to the similarity of word embedding. Experiments on prose reading comprehension datasets of the college entrance examination indicate that our method performances better than traditional methods of words extension.