由于基于规则方法的文本处理系统在系统建立时需要总结大量的规则,而且很难保证它在处理大规模真实文本时的强壮性,因此本文在使用统计方法进行韵律短语切分方面做了一些有益的探索。先对文本进行自动分词和自动词性标注,然后利用从已经经过人工标注的语料库中得到的韵律短语切分点的边界模式以及概率信息,对文本中的韵律短语切分点进行自动预测,最后利用规则进行适当的纠错。通过对一千句的真实文本进行封闭和开放测试,
词性标注的正确率在95%左右,韵律短语切分的召回率在60%左右,正确率达到了80%。
Abstract
It is often difficult to construct a rule-based parser and adapt it to largescale real text . So we tried a statistical approach to prosodic phrasing. At first the text was segmented into Chinese words ,then word sequences are tagged automatically by POS tagger. The boundary pattern and boundary distribution probabilities are used in the algorithm to predict phrase breaks. The boundary distribution probabilities are derived from hand-annotated corpus. The errors caused by statistical method are corrected by rules. Through close testing and open testing on about 1000 sentences ,the correct POS tagging rate is about 95% ,the recalling rate of prosodic phrasing is around 60% ,and the correct rate of prosodic phrasing is about 80%.
关键词
韵律短语切分 /
自动词性标注 /
语料库 /
统计方法
{{custom_keyword}} /
Key words
prosodic phrasing /
part-of-speech tagging /
corpus /
statistical approach
{{custom_keyword}} /
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
参考文献
[1] Black A , Taylor P. Assigning Phrase Breaks from Part-of-Speech Sequences. Proceedings of Eurospeech 97 ,Rhodes ,Greece ,1997 ,12 :995 - 998
[2] Chiu-yu Tseng ,Da-de Chen. The Interplay and Interaction between Prosody and Syntax : Evidence from Mandarin Chinese. Proceedings of ICSLP ,2000
[3] Laurent Blin ,Mike Edgington. Prosody Prediction Using a Tree-structure Similarity Metric. Proceedings of ICSLP ,2000
[4] Shigeru FUJIO , Yoshinori SAGISAKA ,Norio HIGUCIH. Stochastic Modeling of Pause Insertion Using Context-Free Grammar. IEEE Transactions on Speech and Audio Processing ,1995
[5] Steven Abney. Prosodic Structure , Performance Structure and Phrase Structure. In : Proceedings ,Speech and Natural Language Workshop ,Morgan Kaufmanns Publishers ,San Mateo ,CA ,1992 ,425 - 428
[6] Taylor P ,Black A. Assigning Phrase Breaks from part-of-speech Sequences. Computer Speech and Language ,1998 ,12 : 99 - 117
[7] Zheng-yu Niu ,Pei-qi Chai. Segmentation of Prosodic Phrases for Improving the Naturalness of Synthesized Mandarin Chinese Speech. Proceedings of ICSLP ,2000
[8] 曹剑芬. 普通话节奏的声学语音学特性. 见:第四届全国现代语音学学术会议论文集,1999
[9] 应宏,蔡莲红. 基于结构助词驱动的韵律短语界定的研究. 中文信息学报,1999 ,13 (6)
[10] 赵军,黄昌宁. 结合句法组成模板识别汉语基本名词短语的概率模型. 计算机研究与发展,1999 ,36 (11)
[11] 周强,黄昌宁. 汉语概率型上下文无关语法的自动推导. 计算机学报,1998 ,21 (5)
[12] 周强,孙茂松,黄昌宁. 汉语最长名词短语的自动识别. 软件学报,2000 ,11 (2) :195 - 201
[13] 周强. 一个汉语短语自动界定模型. 软件学报,1996 ,7 (增刊) :315~322
[14] 周强. 汉语短语的自动划分和标注. 中文信息学报,1996 ,10 (1)
[15] 周强,张伟. 一个改进的汉语短语自动界定模型. 中文电脑国际会议ICCC’96 (新加坡) ,1996 ,75~81
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}