2013 Volume 27 Issue 2 Published: 15 April 2013
  

  • Select all
    |
    Review
  • Review
    QI Zhenyu, LIU Kang, ZHAO Jun
    2013, 27(2): 1-10.
    Abstract ( ) PDF ( ) Knowledge map Save
    Entity Set Expansion is one of the important problems in Open Information Extraction. Entity Set Expansion refers to expanding several given seeds of one concept into a more complete set. Most approaches solve the problem by using context or distributional information, suffering from the limitation of seed ambiguity problem which results in poor results. In this paper we present a novel method which introduces the semantic knowledge by leveraging Wikipedia knowledge base. We combine this method with traditional template based method. Experiment results show that the proposed method improves 18.5% in precision, 6.8% in recall and 22.8% in MAP.
    Key wordsEntity Set Expansion; knowledge base; semantic knowledge
  • Review
    ZHANG Yonglei, WANG Hongling, ZHOU Guodong
    2013, 27(2): 10-17.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid growth of information in recent years, sentence compression as a subtask of summarization attracts more attention. However, the research on sentence compression is in its initial stagethe performance is still beyond satisfaction and it suffers from unavailability of uniformed evaluation metrics. This paper falls in the framework of simply shortening a sentence by deleting words or constituents, and adopts structured learning approach coupled with the large margin training process. Further more, it proposes two new automatic evaluation metrics (N-Gram and BLEU) for sentence compression. Experimental results show that using of structured learning have maintained a good compression ratio while reserving the main information of source sentence. It also shows that the proposed two evaluation metrics effectively reflect the quality of sentence compression.
    Key wordssentence compression;structured learning;automatic evaluation
  • Review
    ZHOU Xiaopei, HONG Yu, CHE Tingting, YAO Jianmin, ZHU Qiaoming
    2013, 27(2): 17-26.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we propose an unsupervised approach to inferring implicit discourse relation (i.e. relation such as contingency or comparison that is not marked with a connective) based on information retrieval. With Google search engine, we extract candidate explicit relations which are similar to implicit relation on syntactic and semantic levels. These explicit relations which have achieved high accuracy are used to infer implicit relation. The proposed approach contains three modulesfirstly, we construct high-quality queries and extract candidate explicit relations; then three inference models (Similarity, Confidence, Relevance) are presented to evaluate the quality of queries and candidate relations; and finally, base on learning to rank candidate relations, we acquire the statistics of discourse senses distribution to realize the prediction of implicit discourse relation. Experimental results on the PDTB 2.0 show the accuracy of 54.3%, which is a significant improvement of 14.3% over the supervised system.
    Key wordsimplicit discourse relation; unsupervised; information retrieval; PDTB 2.0
  • Review
    CAO Xinyu1,2, CAO Cungen1, WU Yuming1,2
    2013, 27(2): 26-34.
    Abstract ( ) PDF ( ) Knowledge map Save
    The Web becomes an important resource of knowledge acquisition with the rapid development of Internet. The acquisition of part-whole relations is an important sub-task of knowledge acquisition. We proposed a method of acquiring part-whole relations from the Web using the search engine. Firstly, to acquire corpus rich in part-whole relations from the Web, we construct a type of query intended for part-whole relations. Secondly, we extract part-whole relations by filtering the corpus according to the HTML tags and the query formats. Finally, we define a measure of verifying the part-whole relations according to characteristics of part-whole relation expressions and pa-tterns of Chinese word formation. The experimental result shows that our method achieves the accuracy of 86% in the top twenty results and the best F-measure of 64%.
    Key wordspart-whole relation; knowledge acquisition; relation acquisition
  • Review
    WANG Zhiqiang1, LI Ru1,2, YIN Zhizhou3, LIU Haijing1, LI Shuanghong1
    2013, 27(2): 34-41.
    Abstract ( ) PDF ( ) Knowledge map Save
    Semantic roles labeling is a kind of the shallow semantic analysis. Currently, Chinese frame semantic roles labeling is generally viewed as sequence labeling task based on the basic tagging unit of words. The current work is defected in only word or POS information considered. This paper studies the impact of the dependency features on the semantic roles labeling under the T-CRF model, integrating the dependency features among the words in the dependency syntax with the word and POS information. The experiment with 24 feature templates in 8 categories shows that the F-measure of the best templates is improved by 3%. Especially, the results on the long frame semantic roles are improved more significantly.
    Key wordsframe semantic roles;dependency features;T-CRF model
  • Review
    NetAlifu·KUERBAN1,2, Wumaierjiang·KUERBAN3,FANG Dingyi1
    2013, 27(2): 41-47.
    Abstract ( ) PDF ( ) Knowledge map Save
    This article establishes the semantic roles and types of UFN (Uyghur FrameNet) on the basis of word level to build the Uyghur FrameNet. It further studies on the Tagset, Phrase Types and Syntactic Function of UFNs semantic roles. This work paves the way for the dependency relation of semantic frame elements, decomposition and identification of semantic roles, as well as the construction of the semantic role database and automatic labeling of UFN on Arabic script.
    Key wordsUyghur language; frame semantics; semantic roles; phrase types; syntactic function; tagset
  • Review
    SHU Yan, LV Xueqiang
    2013, 27(2): 47-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Corpus annotation is a fundamental work of corpus construction. Based on Sogou logs, this paper develops a set of annotation specification according to the characteristics of the corpus to build the phrases dictionary for search engine. In practice, the annotation process is completed as the task of node attribution filling in the XML file. With the proposed guideline, 145 645 query strings has been annotated for their labels with a high quality.
    Key wordscorpus annotation; Sogou logs; phrases dictionary; annotation specification
  • Review
    ZHENG Jianghua1,2, WANG Guansheng1,2, Wahap·HALIK1,2, Adili·ROUZ1
    2013, 27(2): 52-58.
    Abstract ( ) PDF ( ) Knowledge map Save
    Since few researches and applications in Uygur geographical information services have been completed, Uyghur users can not enjoy the geographic information services in Uighur language. This paper takes graph and text information service of Xinjiang Dynamic Weather System as an example and puts forward a kind of WebGIS service solution for Uighur website. Firstly, it builds Xinjiang County weather information service in Uighur by integrating free Google Map API service and real-time free Yahoo Weather RSS weather information; Secondly, it takes the module as a reference object to be embedded into any sort of public website. This work contributes to the application of WebGIS in Uighur, providing a convenient graph and text Information service of Dynamic Weather for Xinjiang peoples.
    Key wordsUighur; Google Maps API; Yahoo Weather; custom font; WebGIS; embedded web service
  • Review
    AN-jiancairang
    2013, 27(2): 58-65.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper proposes an algorithm of automatic correction to Tibetan words by combining rules and grammatical parsing of Tibetan words. This algorithm does not need any dictionary or big corpora. By studying the rules of Tibetan word formation, we get the structural characteristics of Tibetan words and then process the words from their segments. By doing so, the complicity of Tibetan word formation is simplified and we thus sum up the rules of the segments in word formation. It is well indicated in the experiments that the debugging rate of our system can reach 100% satisfaction.
    Key wordsTibetan word; segments; rules; automatic correction
  • Review
    HAN Pu1,WANG Dongbo1,LIU Yanyun2,SU Xinning1
    2013, 27(2): 65-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Different part-of-speeches have different roles in document clustering. Using 4 popular English and Chinese datasets, the paper choose three clustering algorithms to investigate the influence of 4 major part-of-speeches as well as their combination on Chinese and English document clustering. The experimental result reveals that nouns are the most important in presenting the content of the document. Besides, verbs, adjectives and adverbs contribute to document clustering. Although similar result is obtained from the experiments, nouns. Using only nouns to characterize the document can not produce the best clustering result, but it can reduce the document dimensions to a great extent. The combination of 4 part-of-speeches produces the best clustering result. Single part-of-speech vary considerably in Chinese and English document clustering performance,and the differences are more consistent in Chinese document clustering.
    Key wordspart of speech tagging; document clustering; text feature
  • Review
    WANG Daoping1, HUANG Wenli2
    2013, 27(2): 74-79.
    Abstract ( ) PDF ( ) Knowledge map Save
    Chinese Character Component Standard of GB 13000.1 Character Set for Information Processing and Specification of Common Modern Chinese Character Components and Components Names are two important documents for Chinese character component standardization. In practice, they are defected in hard-to-remember large number of coding components and lack of convenient and reliable disassembly rules. To deal with these issues, we need to re-exam the component disassembly rules, combine them closely with components, so as to improve the standards for the joint benefit of Chinese character input, teaching and indexing together.
    Key wordsChinese character component; disassembly rule; Chinese character input; Chinese character teaching; Chinese character indexing; standard
  • Review
    Feng Gefei1,2,3,Gu Shaotong1,2,3,Yang Yiming1,2,3
    2013, 27(2): 79-86.
    Abstract ( ) PDF ( ) Knowledge map Save
    The feature extraction is an indispensable step for the computer-aided restoration, identification and periodization of oracle bone inscriptions. First, the preprocessing of the original oracle bone inscriptions is developed to separate the shape from background noise accurately. Then, based on the mathematical morphology, 12 kinds of shape features are extracted from oracle bone inscriptions. An oracle bone inscriptions shape features extraction system using mathematical morphology is implemented. We collected the data from “Collection of Research on Bone Shell Inscriptions” for experiments. The results show that mathematical morphology method could extract shape features effectively.
    Key wordsMathematical Morphology; oracle bone inscriptions; shape features; feature extraction
  • Review
    YU Hui, XIE Jun, XIONG Hao, LV Yajuan, LIU Qun, LIN Shouxun
    2013, 27(2): 86-91.
    Abstract ( ) PDF ( ) Knowledge map Save
    The present statistical machine translation (SMT) models only exploit the context information in a sentence and neglect that in the document which is more useful to find the correct translation. In this paper, we propose a new method of using the context of the whole document to improve the quality of SMT. We obtain the similarities between the documents of the training corpus and the documents of the test set using Vector Space Model. The similarity is then considered as a new feature and integrated into a phrase-based model. Large scale experiments show that our approach improves more than one point for NIST-08 and CWMT-08 in term of BLEU in English to Chinese translation task.
    Key wordsStatistical Machine Translation; context information; Vector Space Model
  • Review
    FAN Liang, DAI Yong, QIN Bingmei
    2013, 27(2): 91-98.
    Abstract ( ) PDF ( ) Knowledge map Save
    Taking the writing of Chinese character on touch screen for example, we proposed a fuzzy method to analyze the stroke force of writing, and predict the writing quality according to this analysis. the proposed method first detects the key points from the characters strokes and set up the matrix of fuzzy template. Then the similarities between the real writing and template are estimated. The experiment shows that the method is feasible, robust, as well as language independent.
    Key wordstouch screen; handwriting of Chinese character; writing quality; stroke force; fuzzy method
  • Review
    Mirigu·ROUZI,Tuergen·YIBULAYIN,Mairehaba·AILI
    2013, 27(2): 98-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    The intelligent input methods, as a research topic of Uyghur input, is one of the fundamental issue in Uyghur information processing. According to the characteristic of Uyghur, this paper analyzes the errors occurred in the input procedure of users, designs and realizes the knowledge base of word collocation. It also proposes a novel Uyghur input method based on bi-gram language model, capable of automatic prediction, association, and correction.
    Key wordsChinese information processing; Uyghur; intelligent input method; language model; automatic predication; automatic association
  • Review
    ZHU Jie1,2 , Ngodrup2 ,GeSang Dorje2, ZHA Xijia2, GAO Hongmei2
    2013, 27(2): 103-112.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan syllable has a unique method of configuration. Different positions of configuration have different Tibetan characters. According to the combination of Tibetan characters, the ever-changing Tibetan syllables are generated. Because of the voice features of the letter, there are many limitations of the styles of Tibetan combination. By use of Tibetan grammar rules and Tibetan-Chinese Dictionary, a Tibetan syllable rule base is established and its applications is analyzed.
    Key wordsTibetan; rule of Tibetan; word frequency
  • Review
    LU Shidan, CUI Rongyi
    2013, 27(2): 112-118.
    Abstract ( ) PDF ( ) Knowledge map Save
    The identification of six pairs of Korean and Chinese monophthongs which are similar in pronunciation is studied based on formant. Firstly, the first three formant F1, F2 and F3 in audio file are extracted; secondly, the differences between the formant distributions of the six pairs of monophthongs are examined, and different characteristic parameters of formants or their combination are employed for different objects as classification feature; at last, the classification threshold is determined by using information gain. The experimental results reveal that the Korean monophthongs and Chinese monophthongs with similar pronunciation can be distinguished by the method proposed in this paper.
    Key wordsKorean monophthong, Chinese monophthongs, language recognition, formant, information gain
  • Review
    ZOU Zhihua1, TIAN Shengwei 2, YU Long3, FENG Guanjun4
    2013, 27(2): 118-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    The paper proposes an improved suffix tree clustering algorithm for Uyghur Web text (STCU), with the Uyghur word as the basic unit in the construction the suffix tree. According to the characteristics of Uyghur and Web texts, we design the Uyghur word stemmer, and construct Uyghur absolute stop word table and relative stop word table. We adopt the document frequency and part-of-speech information to extract key phrases, and then automatically adjust clustering threshold according to the number of Web corpus. Finally, we utilize the most general phrases to describe clustering category information, effectively improving the quality of clustering results. Compared to the traditional suffix tree clustering, the error rate has dropped 0.94%, and in turn, the overall rate and the precision have improved by 44.51% and 11.74%, respectively.
    Key wordsUyghur; suffix tree; phrase clustering; stop word list; document frequency