Journal of Chinese Information Processing

Select

Language Recognition and Its computing

On Classifier Predicates in Chinese Sign Language

YAO Dengfeng, JIANG Minghu, CHANG Jung-hsing, Abudoukelimu Abulizi

2018, 32(3): 1-8.

Abstract ( ) PDF ( )

Knowledge map

Save

Classifier predicates is a unique language phenomenon in sign language. Chinese scholars-study on classifier predicates has just started, without yielding a systematic report. This paper attempts to explain the phenomenon of classifier predicates from the perspective of linguistics. It explains the classifier predicates in Chinese sign language, and analyzes how the figure and the ground proforms formed to achieve synchroneity and sequentiality requirements of sign language by combined analysis of Talmy‘s Dynamic events and proforms. It reveals that the figure proforms and ground proforms are usually made by non-moving hand shape.It also illustrates the intermotion of sign language and Chinese, making a detailed description and classification to the two sign preforms “motion” and “be located”.

Select

Language Recognition and Its computing

On Formation and Constraintsof Redundant Negation

CHEN Xiuqing

2018, 32(3): 9-16.

Abstract ( ) PDF ( )

Knowledge map

Save

This study discusses the formation mechanism and constraints on redundant negation by themethod of combining quantitative analysis with qualitative analysis. The formation of redundant negation is the syntactic realization of semantic hidden negation, on the basis on implicit negative words. Implicit negative words also lay essential shape restrictions to the formation of redundant negation. The negative meaning of implicit negative words vary according to different sources, leading to the different abilities in redundant negation. These words with negative meaning form entailment have the strongest ability to form redundant negation, followed by those from presupposition, and those from conversational implication, with those form assertion have the least ability to from the redundant negation.. This statistics reveals this with a clear steep curve.

Select

Morphology, Syntax, Semantics

Unsupervised New Word Extraction from Chinese Social Media Data

ZHANG Jing,HUANG Kaiyu, LIANG Chen, HUANG Degen

2018, 32(3): 17-25,33.

Abstract ( ) PDF ( )

Knowledge map

Save

Aiming to extract new words from Chinese social media data, a novel unsupervised method which utilizes traditional statistical measure and word embedding is proposed. Traditional statistical measure is applied to extract new word candidate list from segmented social media corpus, and then word embedding is trained via multi-strategies to filter out noises from the new word candidate list by constructing anti-word set which contains segments that are less likely to become a new word combining with other segments. Besides, in order to analyze traditional statistical measure and tuning thresholds, we annotated 10,000 tweets as development corpus, which is proved to be reliable by the experimental results. To assess the proposed method, the corpus released as training corpus by the evaluation of microblog-oriented Chinese word segmentation in NLPCC2015 is used as test corpus. The results show that our method significantly improves the new word extraction performance comparing to the baseline systems. The bigram and trigram new word extraction result on test corpus reaches 56.2% in F1-measure.

Select

Knowledge Representation and Acquisition

Space Projection and Relation Path Based Representation Learning for Construction of Geography Knowledge Graph

DUAN Pengfei, WANG Yuan, XIONG Shengwu, MAO Jingjing

2018, 32(3): 26-33.

Abstract ( ) PDF ( )

Knowledge map

Save

Humanoid intelligence has developed rapidly and it benefits from the complete knowledge graph especially elementary education knowledge graph represented by geography. The traditional knowledge graph is represented by network leading to high computation complexity. This paper puts forward a new algorithm named PTransW (Path-based TransE and Considering Relation Type by Weight). It combines the space projection with the semantic information of relation path, taking advantage of the semantic information of relation type for further improvement. The experiment results on the FB15K and GEOGRAPHY data sets show that the ability of dealing with complex relation in knowledge graph is improved significantly by PTransW model.

Select

Knowledge Representation and Acquisition

Open Domain Knowledge Reasoning for Chinese Based on Representation Learning

JIANG Tianwen, QIN Bing, LIU Ting

2018, 32(3): 34-41.

Abstract ( ) PDF ( )

Knowledge map

Save

Knowledge bases are usually represented as networks with nodes for entities and edges for relations. To utilize knowledge bases, people have to design algorithms with high complexity, which are not suitable for knowledge reasoning, or especially for expanding knowledge bases. This paper uses TransE-based model to learn representation for knowledge reasoning, including the study on relation indication reasoning and tail entity reasoning. The experiments on relation indication reasoning get an excelletn result without designing complex algorithms: just some simple operations in vector space. Besides, we improve original loss function of TransE model to be more suitable for open domain Chinese knowledge bases in representation learning.

Select

Machine Translation

Divide-and-conquer Strategy of Maximal NP for NMT

ZHANG Xueqiang, CAI Dongfeng, YE Na, WU Chuang

2018, 32(3): 42-48,63.

Abstract ( ) PDF ( )

Knowledge map

Save

Neural Machine Translation (NMT) is defected in long sentences with complex structure owing to its neglect of linguistic knowledge of sentence structure. Adopting the idea of divide-and-conquer strategy, this paper proposes to identifying and extracting the Maximal Noun Phrases in a sentence, and retaining special marks or head words and the rest component to form the sentence framework. Then the Maximal Noun Phrases and sentence frames are translated by NMT, respectively. Experimental results show that the method proposed yields 0.89 imporovments in terms of BLEU score compared with the baseline system.

Select

Ethnic Language Processing and Cross Language Processing

The Uyghur Emotional Lexicon Extension Based on Conjunctions

LIU Ruolan, NIAN Mei, Maierhaba Aisaiti

2018, 32(3): 49-54.

Abstract ( ) PDF ( )

Knowledge map

Save

Emotion words are the fundamental resource for accurately analysis the opinions of the Uighur language. We investigates the automatic expansion of the web emotional words on the basis of an existing Uighur sentiment lexicon. First, we summarize the collocation rules of the conjunctions, degree adverbs and sentiment words by analyzing the linguistic features of Uighur emotional expression. Based on the rules, we design an algorithm to obtain the candidate emotional words from emotional corpus, forming the candidate sentiment lexicon. Finally, we use the Internet as a super-large corpus to design the emotional discriminant algorithm based on search engine by reusing the characteristics of Uighur conjunctions and combining with the established emotional lexicon and Uighur antonyms dictionary. The polarity of candidate emotional words is decided according to the score calculated by the algorithm, and then add them to the emotional lexicon. Compared with the emotional lexicon that was not expanded, the experimental results showed that the accuracy and recall rate of Uyghur sentence‘s tendency are significantly improved by our extended dictionary.

Select

Ethnic Language Processing and Cross Language Processing

Cross-lingual Documents Similarity Measure Based on Co-occurrence Mapping Between Chinese, English and Korean

LIU Jiao, CUI Rongyi, ZHAO Yahui

2018, 32(3): 55-63.

Abstract ( ) PDF ( )

Knowledge map

Save

The paper analyses the cross lingual document similarity measure between different languages, including Chinese, English, and Korean. Initially, this paper maps a document vector in a language to another by co-occurrence information. The Latent Semantic Analysis is then employed to remedy the lack caused by polysemy across languages. Finally, the cosine similarity between two documents is calculated in the same space with equivalent semantic information. This method does not rely on a pre-existing external dictionary and knowledge base, but use the parallel corpus to establish the lexical relationship between Chinese, English, and Korean. It turns out that co-occurrence mapping contributes substantially to documents similarity measure, resulting an 95% accuracy of translation retrieval.

Select

Ethnic Language Processing and Cross Language Processing

Word Vector Based Cross Lingual Event Retrieval for Vietnamese and Chinese

TANG Liang, XI Yaoyi, PENG Bo, LIU Xiangwei, YI Mianzhu

2018, 32(3): 64-70.

Abstract ( ) PDF ( )

Knowledge map

Save

This paper proposes a novel word vector based cross lingual event retrieval method for Vietnamese and Chinese. First, the Chinese event keywords are computed for their semantic feature vector via the word vector. Then, the corresponding Vietnamese translation vectors are computed. Finally, the cross-language keyword alignment is calculated via the similarity between the two semantic feature vector spaces. The input query can thus be mapped into the other language, and the cross lingual event retrieval is realized. Experiments on the South China Sea events related Vietnamese-Chinese bilingual corpus have shown the effectiveness of the method.

Select

Ethnic Language Processing and Cross Language Processing

Transliteration Rules for Human Names Between Chinese Characters and Tibetan Syllables

LONG Congjun, Douge Tsiring, LIU Huidan

2018, 32(3): 71-76.

Abstract ( ) PDF ( )

Knowledge map

Save

With the development of information technology, the Tibetan language was widely used on the Internet. To deal with the transliteration issue form Chinese texts to Tibetan, this paper collects five Tibetan website texts and examines the unified forms of transliterations. After analyzing the causes of confusion transliteration between Chinese and Tibetan, this paper proposes some transliteration principles. It also suggests that relevant government agencies should actively promote the standardization of transliteration norms.

Select

Information Extraction and Text Mining

Sports News Generation Based on Neural Networks

LI Yichen, HU Po, WANG Lijun

2018, 32(3): 77-83.

Abstract ( ) PDF ( )

Knowledge map

Save

It is often time-consuming and laborious for a journalist to write sports news. In this paper, we propose a neural network model to automatically generate the sports news on the basis of the sports live scripts. The model avoids the manual feature extraction. Besides, it can also consider sentence-level information and global information within the script as well as the semantic relevance between sentences and corresponding news content in the scripts. The experimental results on the open data set verify the feasibility and effectiveness of the proposed method. In addition, we also try to generate the title of sports news based on rules and templates to extract the key content of it.

Select

Information Extraction and Text Mining

Segment-level Chinese Named Entity Recognition Based on Neural Network

WANG Lei, XIE Yun, ZHOU Junsheng, GU Yanhui, QU Weiguang

2018, 32(3): 84-90,100.

Abstract ( ) PDF ( )

Knowledge map

Save

Chinese Named Entity Recognition (NER) is an important task for Chinese information processing. In this paper, we explores the segmental neural architectures for Chinese NER, by regarding the task as a joint segmentation and labelling problem. The proposed methods can learn the effective segment-level representation and contextual information and then assign tags to the segments. The experimental results on MSRA corpus show that our method can achieve comparable performance with the state-of-the-art systems for Chinese NER and the F1 value gets 90.44%.

Select

Information Extraction and Text Mining

Detection of Parallel Web Pages Based on theAutomatically Discovered Bilingual URL Pairing Patterns

ZHANG Chengzhi, MA Shutian, KIT Chunyu, YAO Xuchen

2018, 32(3): 91-100.

Abstract ( ) PDF ( )

Knowledge map

Save

Parallel corpora are one of the most important resources for natural language processing, a large volume of which can be mined from bilingual parallel web pages. This paper formulates a practical algorithm for recognizing parallel web pages based on the credibility of automatically discovered bilingual URL pairing patterns (or keys), then this paper extends it in two ways to find more parallel web pages, namely, rescue weak keys of low local credibility in terms of their global credibility, and unearth bilingual parallel deep web pages by means of applying strong keys of high global credibility. Furthermore, we detect more bilingual web sites according to their credibility in terms of their link relationship with the seed set of web sites in use, and also utilize search engines to recognize bilingual web sites efficiently with only a small set of URL pairing patterns of high credibility. To further enhance the recognition accuracy on top of these five methods, we calculate cross-lingual similarity of candidate parallel web pages and filter out weak ones with a threshold. The effectiveness of our approaches is confirmed by a series of experiments.

Select

Information Extraction and Text Mining

DRTE: A Term Extraction Method for K12 Education

LI Siliang, XU Bin, YANG Yuji

2018, 32(3): 101-109.

Abstract ( ) PDF ( )

Knowledge map

Save

Term extraction is an essential task where terms are extracted automatically from unstructured text based on a specific domain. Previous methods largely rely on terms-statistic information. However, terms in k12 education area have serious long-tail effect, which makes it hard to extract terms at the tail part in methods based on statistics. In this paper, we propose DRTE, a method which focus on extracting terms from their definitions and relations. Our method also utilizes term-formation rules and boundary detection strategies. Experiments on math textbooks for middle school and high school reveal 82.7％ on F1 performance of our method, which significantly outperforms the current method by 40.8%.

Select

Information Retrieval and Question Answering

Social Short Text Retrieval Based on Multiple-enhanced Graph and Topic Model

LIU Dexi, FU Qi, WEI Yaxiong, WAN Changxuan, LIU Xiping, ZHONG Minjuan, QIU Jiahong

2018, 32(3): 110-119.

Abstract ( ) PDF ( )

Knowledge map

Save

Social short texts, coming from Twitter, Sina Microblog, etc., are limited in length but bear diversified topics, complex social relationships, as well as strong correlation with Web pages. Therefore, the traditional information retrieval methods are not suitable for the socialized short texts. The paper proposes a social short text retrieval method, SSTR, based on multiple-enhanced graph. The multiple-enhanced graph algorithm is based on Markov chain theory, where three types of relationships between short texts, authors, and tokens are considered. In SSTR, topic model based on LDA is employed when computing the similarity between short texts, which could overcome the disadvantages of TF-IDF feature. Experimental results show that, compared to cosine similarity based and LDA based re-ranking, SSTR produces better re-ranking result.

Select

Information Retrieval and Question Answering

Large Scale Semantic Rule based Backward Chaining Reasoning on Spark

GU Rong, WANG Shanyong, GUO Chen, YUAN Chunfeng, HUANG Yihua

2018, 32(3): 120-134.

Abstract ( ) PDF ( )

Knowledge map

Save

With the rapid growth of semantic data in recent years, the backward chaining reasoning method appears to be a new research direction since it is insensitive to data changes. Due to its complex reasoning procedure and large searching space of rule extensions, the backward chaining reasoning is still defected in the reasoning performance and scalability. This paper makes a thorough analysis on the characteristics of semantic rules based upon previous works about backward chaining reasoning technique, and proposes the design of a parallelized large-scale backward chaining reasoning engine of high efficiency and scalability over apache spark, the state-of-art big data processing platform.
The main contributions of this paper can be summarized as: 1) It avoids duplicate reasoning of terminological patterns during real-time reasoning by pre-calculating terminological closure; 2) An optimization methods for reverse reasoning procedure and querying procedure is designed for an improved performance; 3) Spark based implementation of the proposed algorithm is presented. Experimental results on both synthetic datasets and real-world datasets show that our method requires only several seconds to ten-seconds of reasoning over hundreds of millions triples, maintaining high data scalability and node scalability as well.

Select

NLP Application

Word Association Based Answer Acquisition for Reading Comprehension Questions from Prose

QIAO Pei,WANG Suge,CHEN Xin,TAN Hongye,CHEN Qian,WANG Yuanlong

2018, 32(3): 135-142.

Abstract ( ) PDF ( )

Knowledge map

Save

Substantial semantic gap exists between the questions words and the article words in the reading comprehension test for Chinese of the college entrance examination, which may derive from the complexity and diversity of questions, abstract semantic meaning of words, and the rich and implicit semantics of articles. To address this issue, this paper investigates the words association. Specifically, all the words in the corpus are clustered into topics through LDA, which is then filtered by the part-of-speech and frequency, and augmented by the lexeme-related words according to the similarity of word embedding. Experiments on prose reading comprehension datasets of the college entrance examination indicate that our method performances better than traditional methods of words extension.

Please choose a citation manager

Content to export

2018 Volume 32 Issue 3 Published: 15 March 2018