Abstract:This paper presents a brief introduction of the Sense MatrixModel (SMM) ,which employs a matrix representation of text for information retrieval.By taking the distribution of words in the sense direction into account ,SMM represents a document as a term-sense matrix and a document collection as a term-sense-document space.With such a document representation ,some useful data analysis techniques can be introduced or developed ,including matrix norms based similarities , sense weighting ,document transforms with DCT as well as MAD (multi-way data decomposition) ,kNN and SVM classification using the sense matrix representation ,etc. The model also provides novel techniques for cross-lingual IR and multi-lingual text classification without using any separated or integrated translation or“model training”. Some initial experiment results of document DCTwith the SMART IR system are also discussed.
[ 1 ] Ahmed , N. , T.Natarajan and K. R. Rao. 1974. On image processing and a discrete cosine transform. IEEE Trans.on Computers C - 23 (1) : 90 - 93. [2 ] Deerwester ,S. ,S. T. Dumais ,T. K. Landauer ,G. W. Furnas and R. A. Harshman. 1990. Indexing by latent semantic analysis[J ] . Journal of the Society for Information Science ,41 (6) : 391 - 407. [3 ] Greengrass ,E. 2000. Information Retrieval :A Survey. Tech Report [R] .Nov 2000. [4 ] Ide ,N. and J . Véronis. 1998. Introduction to the Special Issue on Word Sense Disambiguation :The Start of the Art.Computational Linguistics[J ] . 24 (1) . [5 ] Kiers ,H.A.L. 2000. Towards a standardized notation and terminology in multiway analysis[J ] . Journal of Chemometrics ,14 :105 - 122. [6 ] Kowalski ,G. and M.Maybury. 2000. Information Storage and Retrieval Systems Theory and Implementation[M] . Kluwer ,2000. [7 ] Krovetz ,R. and W.B. Croft. 1992.Lexical Ambiguity and Information Retrieval [J ] . ACM Transactions on Information Retrieval Systems ,10 (2) :115 - 141. [8 ] Luhn ,H. P. 1957.A statistical approach to mechanised encoding and searching of literary information[J ] . IBMJournal of Research and Development 1 (4) :309 - 317 ,1957. [9 ] Miller ,G. 1990. Wordnet :an On2line Lexical Database [J ] . In Special Issue : International Journal of Lexicography. 3(4) :235 - 312. [10 ] Ogden ,C. K. and I.A. Richards. 1930. The Meaning of Meaning[M] .New York :Harcourt ,Brace &World. [11 ] Salton , G. 1971. The SMART retrieval system2Experiments in automatic document processing [M] . Prentice Hall Inc. ,Englewood Cliffs ,NJ . [12 ] Salton ,G. and M. E.Lesk. 1968. Computer evaluation of indexing and text processing[J ] . In Journal of the ACM,volume 15 (1) ,8 - 36 ,January. [13 ] Sidiropoulos ,N.D. and R.Bro. 2000.On the uniqueness of multilinear decomposition of N2way arrays[J ] . Journal of Chemometrics ,14 :229 - 239. [14 ] SMART 1992 , 源代码下载: ftp. cs. cornell. edu/ ftp/ pub/ smart/ smart. 1110. tar. Z[15 ] Stokoe ,C. et al . 2003.Word Sense Disambiguation in : Information Retrieval evisited[A] . In The 26th ACM2SIGIR Conference on Research and Development in Information Retrieval (SIGIRp03) [C] . [16 ] Sussna ,M. 1993.Word Sense Disambiguation for Free2Text Indexing Using a Massive Semantic Network[A] . In :Proceedings of the 2nd International Conference on Information and Knowledge Management (CIKM) [ C] ,67 - 74 ,Washington ,DC. [17 ] Vooehees ,E.M. 1993. Using WordNet to Disambiguate Word Sense for Text Retrieval [A] . In : Proceedings of the 16th International ACM SIGIR Conference[C] ,171 - 180 ,Pittsburgh ,PA. [18 ] 孙斌. Relative Information and a Sense Matrix Model for IR. TR - 003 , ICL PKU. 北京大学计算语言所报告.2003 - 11. (语言所技术报告编号2004 - 3 ,http :/ / icl. pku. edu. cn/ icl tr/ ) . [19 ] 于江生,俞士汶. CCD 的结构与设计思想[J ] . 中文信息学报. 2002 ,16 (4) :12 - 20.