2013 Volume 27 Issue 4 Published: 15 August 2013
  

  • Select all
    |
    Review
  • Review
    DU Jinhua1, ZHANG Meng1, ZONG Chengqing2,SUN Le3
    2013, 27(4): 1-9.
    Abstract ( ) PDF ( ) Knowledge map Save
    In recent years, statistical based methods have been dominating the research and application of machine translation (MT). Meanwhile, the higher and higher evaluation scores for MT campaigns raise people’s confidence and expectations, which results in an increasing demand for high-quality MT systems. However, on one hand, it is difficult to have a big breakthrough on the MT theories and methodology in terms of translation quality; on the other hand, current practical systems cannot fully meet users satisfaction. Where and how should we go forward? Therefore, the Eighth China Workshop on Machine Translation (CWMT) is held to carry out a comprehensive and in-depth discussion on challenges and opportunities for current MT research. This paper details the six MT sessions, analyzes and concludes the key points and important findings.
    Key wordsMT theories; machine translation application; spoken translation; minority languages; machine translation evaluation
  • Review
    SHEN Shiqi, LIU Yang, SUN Maosong
    2013, 27(4): 9-16.
    Abstract ( ) PDF ( ) Knowledge map Save
    Word alignment aims to determine the corresponding relationship between the words in parallel texts. It has an important influence on machine translation, bilingual dictionary construction and many other natural language processing tasks. Although in recent years the word alignment has made significant progress in modeling and training algorithm, its search algorithm often uses greedy strategies and faces the problem of large search errors. This paper proposed a word alignment search algorithm based on dual decomposition, making a complex problem into two relatively simple sub-problems and iteratively solving it until convergence to the optimal solution. Since the dual decomposition can ensure the convergence and optimality of solutions, this algorithm significantly exceeds GIZA++ and discriminant word alignment system on alignment error rates when testing on the 863 Projects word alignment evaluation data set of 2005. Alignment error rate is decreased by 4.2% and 1.1% respectively.
    Key wordsword alignment; discriminative model; search algorithm; dual decomposition
  • Review
    YU Heng1, TU Zhaopeng1, LIU Qun1, LIU Yang2
    2013, 27(4): 16-22.
    Abstract ( ) PDF ( ) Knowledge map Save
    Machine Transliteration is an important approach for Name-Entity translation. In English to Chinese transliteration, the translation granularityis of great importance.In this paper we introduce a Multi- granularitymachine transliteration method. We use word lattice to combine multiple syllable segmentation, and decode with hierarchical phrase-based translation model. Experimental results show that our method combines the advantage of different granularityand improve the robustness of the system.We achieve an improvement of 3.1% on precision, and 2.2 points on BLEU over the baseline system.
    Key wordsname entity machine transliteration; multi-granularity; word-lattice
  • Review
    LI Maoxi, JIANG Aiwen, WANG Mingwen
    2013, 27(4): 22-30.
    Abstract ( ) PDF ( ) Knowledge map Save
    Automatic evaluation of machine translation plays an important role in promoting the rapid development of machine translation. In this paper, we apply the ListMLE approach to learning to rank for machine translation automatic evaluation. In addition, we introduce the features of translation fluency and adequacy to further improve the consistency between the results of the automatic evaluation and human judgments. When assess the translation quality of the submitted system outputs of WMT11 German-English tasks and IWSLT08 BTEC CE ASR tasks, the experimental results indicate that the predicted accuracy of the proposed approach is higher than the BLEU metric and the one based on RankSVM.
    Key wordsmachine translation evaluation; learning to rank; ListMLE approach; automatic evaluation; human evaluation
  • Review
    LIU Yijia, CHE Wanxiang, LIU Ting, ZHANG Meishan
    2013, 27(4): 30-37.
    Abstract ( ) PDF ( ) Knowledge map Save
    In this paper, we compare three different Chinese word segmentation and POS tagging models.Accuracy and speed are considered during the comparison. First of these three models are pipelinesequential model. The second is a joint model for word segmentation and POS tagging, andthe last one is a combination of two modelsmentionedabove with a stacked learning framework. We conduct experiments on four data sets, including People Daily, CoNLL09, CTB5.0 and CTB7.0. Experimental results show that the joint model achieves the fastest speed while the stacked learning model achievesthe highest accuracy. Finally, we compare our stacked learning model with state-of-the-art systems on data sets CTB5.0 and CTB7.0 and our model achieve the best performance in this comparison.
    Key wordsChinese Word Segmentation; POS tagging; Stacked Learning
  • Review
    ZHANG Guiping, LI Wenbo, WANG Peiyan
    2013, 27(4): 37-44.
    Abstract ( ) PDF ( ) Knowledge map Save
    According to the characteristics of relation judgment task, this paper applied active learning to the ontology conceptual relation judgment, making a comparative study for active learning query generation strategy, including margin sampling, entropy sampling, least confident sampling etc. From a practical point of view, we discussed the application of active learning techniques in three different samples of the initial case. For the initial sample of positive and negative sufficient condition, we used margin sampling and the entropy sampling to generate queries; for the initial sample only the positive cases, we generated candidate negative-sample according to the similarity active learning strategies; for lack of the initial sample, we used the concept of distance between the model and other statistical information to generated a candidate for positive-sample and the candidate negative-sample. Thus, we achieved the effective use of user feedback in the decision process of the conceptual relationship.
    Key wordsontology; concept relation; assistant judgment; active learning
  • Review
    LI Guochen 1,3, ZHANG Lifan1 , LI Ru1,2, LIU Haijing3, SHI Jiao1
    2013, 27(4): 44-52.
    Abstract ( ) PDF ( ) Knowledge map Save
    Frame disambiguation aims to assign appropriate frame for the target words, according to the consistency between semantic scene and the candidate frame evoked by the target words. The key step of frame disambiguation is the feature selection, which is currently a manual process. However, this manual method doesnt effectively use the semantic feature of each target word. In addition, it is proved that the feature templates are different when the target words achieve best results. Hence, this paper proposes an automatic feature template algorithm to set a feature template for each target word. First, feature sets are composed of features from the corpus; Then the feature achieved the highest score is added to the feature template until the adjacent two score no longer increases. The paper applies a maximum entropy model to Chinese FrameNet corpus, examining the automatic feature selection algorithm by 5-fold cross validation, and achieves an average precision of 84.46%.
    Key wordsChinese frame disambiguation; Chinese FrameNet; automatic feature selection; semantic feature of lexical units
  • Review
    LI Yachao1, JAM Yangkyi1, ZONG Chengqing2, YU Hongzhi1
    2013, 27(4): 52-59.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan automatic word segmentation (TAWS) is an important problem in Tibetan information processing, while abbreviated word recognition is one of the key and most difficult problems in TAWS. All the existing methods of Tibetan abbreviated word recognition are rule-based approaches, which need vocabulary support. In this paper, we propose a method based on conditional random field (CRF) for abbreviated word recognition, and then implement a TAWS system with CRF. The experimental results show that our abbreviated word recognition method is fast and effective and can be combined easily with the segmentation model based on conditional random fields. This significantly increases the effect of the Tibetan word segmentation.
    Key wordsTibetan automatic word segmentation; conditional random fields; abbreviated word recognition; case-auxiliary words
  • Review
    LI Lin1, 2, LONG Congjun1, 3
    2013, 27(4): 59-63.
    Abstract ( ) PDF ( ) Knowledge map Save
    Tibetan linking verb and existential verb are commonly applied widely with rich ambiguous. In various contexts, their meanings include judgment, existence, possession, or evidentiality and egocentricity. The diversity of meaning brings difficulties to Tibetan text labeling and sentence pattern identification. We examine the characteristics of such words, build up a rule library, and finally proposed a linking verb and existential verb recognition method.
    Key wordsTibetan; linking verb; existential verb; automatic recognition
  • Review
    NUO Minghua, LIU Huidan, MA Longlong, WU Jian, DING Zhiming
    2013, 27(4): 63-70.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper presents a Chinese-Tibetan base noun phrase alignment method. Its a two-phase procedureChinese base noun phrases identification and finding their Tibetan correspondences. We propose head-phrase extension based Tibetan base noun phrase identification method in accordance with the morphologic characteristics of Tibetan. In the first phase, we use sequence intersection operation to get Tibetan head-phrase. In the second phase, head-phrase extension confidence is defined and applied to determine the boundary of correspondence. Experimental result indicates that sequence intersection outperforms other methods in head-phrase extension. Chinese-Tibetan base noun phrase produced by our method is effective in reducing subsequent manual check, facilitating the construction of translation lexicon on phrase level.
    Key wordsTibetan information processing;BaseNP;head-phrase extension
  • Review
    BAO Xiaorong,HUA Shabao, Dabhurbayar
    2013, 27(4): 70-74.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on Mongolian dependency grammar, this paper develops the semantic role classification method and designs the tag-sets of Mongolian language from the perspective of Mongolian information processing. A manual annotation is carried on certain amount of Mongolian Syntactic Dependency Treebank by focusing on semantic role of Mongolian with a reference to the semantic role labeling theory and methods in other languages.
    Key wordsMongolian corpus; dependency grammar; semantic role
  • Review
    WANG Lijun1, WANG Xiaoming2, WU Jian3
    2013, 27(4): 74-83.
    Abstract ( ) PDF ( ) Knowledge map Save
    Different language usages in the mainland, Hong Kong, Macao and Taiwan derives the issue of Simplified and Traditional conversionin Chinese. The key issue is the corresponding table between Simplified and Traditional Chinese characters and terms,which is a complex task beyond an immediate soultion. A fundamental step is to decompose the correpondence between Simplified and Traditional Chinese characters for the mainland. The conversion system should include six steps, and the proposed concept of “character context” is a strong support to improve the accuracy of Simplified and Traditional conversion.
    Key wordsSimplified characters; Traditional characters; Simplified and Traditional correspondence; Simplified and Traditional conversion
  • Review
    GAN Lixin1, TU Wei1, WANG Mingwen2, SHI Song3
    2013, 27(4): 83-89.
    Abstract ( ) PDF ( ) Knowledge map Save
    Query expansion is effective to improve retrieval efficiency. In this paper, the mixed correlation between terms is quantized by term cliques which are obtained from Markov network, so as to solve the computation of the term relationship lack of cooccurence in corpus. The enhanced mixed correlation is then applied to query expansion. The experimental results show that the proposed method outperforms that based on direct correlation. In addition, the method is slightly better than a Markov network model based on cliques significantly reduces the computational overhead of term cliques.
    Key wordsmixed correlation; Markov network; query expansion
  • Review
    WANG Hailei1,2, ZHANG Yanxing3, ZHAO Haiyu3, ZHANG Ming3
    2013, 27(4): 89-96.
    Abstract ( ) PDF ( ) Knowledge map Save
    Based on theories of consumer behavior analysis and discrete choice analysis, we perform the feature based sentiment analysis on user reviews with features selected from both objective product features and subjective user opinion data. Then we \ train a MNL model to predict products consumer surplus as the ranking criterion. With this method, we implemented product search engine for cell phone, laptop and digital camera. Double-blind trial on users shows that our model significantly outperforms the baselines.
    Key wordsproduct search; MNL model; sentiment analysis; feature selection; user generated content
  • Review
    PENG Zehuan, SUN Le, HAN Xianpei, SHI Bei
    2013, 27(4): 96-103.
    Abstract ( ) PDF ( ) Knowledge map Save
    This paper summarized four types of recommendation-related user information from micro-blog systemthe user content(UC), the personal information(PI), the interaction(IA) and the social topological information(ST). Based on the four types of information, a user recommendation framework using learning-to-rank technology is built in the paper. Experiment results show(1) using several features to recommend usually get a better result than using a single feature; (2) recommendation performance based on UC, PI, IA respectively is better than that based on UC.
    Key wordslearning to rank; user recommendation; micro-blog.
  • Review
    YU Long1, TIAN Shengwei2, HUANG Jun3
    2013, 27(4): 103-113.
    Abstract ( ) PDF ( ) Knowledge map Save
    Topic extraction is one of the core tasks of opinion mining. This paper proposes a claim-level topic extraction method, which aims at extracting explicit topics and implicit topics of Uighur comment texts. This method uses GLR-Cascaded LDA (Cascaded LDA model for global topic, local topic and the relation between them, GLR-Cascaded LDA) to extract the local topics of paragraph level, global topics of document level, establish the global-local topic relationship, and corresponds the relationships to each opinion claim. It adopts Bootstrapping and pattern matching to extract the topics of explicit claims. Finally, the implicit topic inference algorithm is applied to deduce the topics of implicit claims. The ultimate goal of topic extraction is to establish an opinion quadruple of claim-topic <OC, GT, LT, LT> for each opinion claim. Experimental results indicate the effectiveness of the proposed method in topic extraction task.
    Key wordstopic extraction; claim level; explicit topic; implicit topic; Uighur
  • Review
    WANG Zhihao,WANG Zhongqing,LI Shoushan,LI Peifeng
    2013, 27(4): 113-119.
    Abstract ( ) PDF ( ) Knowledge map Save
    With the rapid development of Internet, the task of sentiment classification has attracted a great attention by many researchers in the area of natural language processing. In this paper, we focus on the sentiment classification tasks where the data distribution is imbalanced (named imbalanced sentiment classification). To reduce the high-dimensional feature space in imbalanced sentiment classification, we investigate four classic feature selection (FS) methods that are popularly studied in traditional text categorization. Furthermore, three different feature selection modes are proposed and compared in the specific task. The experimental results demonstrate that using the feature selection methods is capable of significantly reducing the dimension of the feature vector without any loss in the classification performance. Besides, the results show that the FS method of information gain (IG) combined with the mode “Feature selction after random under-sampling” performs best.
    Key wordssentiment classification; imbalanced data; feature selection
  • Review
    LI Zhe, WANG Zhihai, HE Yingjing, FU Bin
    2013, 27(4): 119-127.
    Abstract ( ) PDF ( ) Knowledge map Save
    In the domain of multi-label classification, the goal of a multi-label classifier is to assign a set of labels to an instance. One of the classical methods is to transform a multi-label classification problem to several traditional binary classification problems, and thus some relations may exist among these binary classifiers. Simply taking label dependency into account can improve the classification performance to a certain extant, but it is also necessary to consider the computational complexity. This paper proposes an ordered ensemble of classifiers algorithm, which selected a proper order for classifiers using a heuristic search strategyfor better use of the label dependency. In the experiment, a broad range of multi-label datasets and a variety of evaluation metrics are used, and experiment result shows that the proposed method outperforms some state-of-the-art methods.
    Key wordsmulti-label classification; text categorization; data mining