载入中....
设为首页 收藏本站 联系我们 网站地图
论文网
您现在的位置: 免费毕业论文网 >> 计算机论文 >> 计算机理论 >> 正文
搜索: 论文

全文检索系统Lucene的分析与扩展

更新时间 2009-11-28 23:35:46 点击数:

全文检索系统Lucene的分析与扩展
Lucene Full-text Search System Analysis and Expansion
【摘要】 全文检索技术是一个最普遍的信息查询应用,人们每天在网上使用Google、百度等搜索引擎查找自己所需信息,这些搜索引擎的核心技术之一就是全文检索。Lucene是Apache软件基金会Jakarta项目组的成员项目,是一个开放源代码的全文检索引擎工具包,方便在目标系统中添加全文检索的功能,或者是以此为基础建立起完善的全文检索系统。Lucene中只具备英文和德文两种西方语言的检索功能,并不具备中文检索功能,因此如果开发一个基于Lucene的全文检索系统,一个中文检索模块必不可少。为了能够更加准确的实现分词同时避免歧义,本文对现在比较流行的基于统计的分词方法进行了改进,以词典训练的方式解决了一部分歧义词的处理以及未登录词汇的切分。本文的算法是建立在一个自定义的词典基础上的,本文中的词典并不是传统意义上的机械分词中的词典,在一篇文章中,两个字按照一定的次序出现的次数越多,那这两个字就更可能是一个词,所以我们定义了这样一个统计词典:它建立在对大规模的语料进行统计和分析的基础上,它其中的词条并不是通常我们所说的词,而是两个相邻的字之间的“黏合度”,即“黏合度”越高,成词的概率就越高。Lucene的内核被设计得非常小巧,它的处理对象仅限于纯文本格式数据。因此,本文建立了一个通用的接口,开发一个能够用来索引多种格式文档的统一处理框架,通过这个框架索引各种文档内容,添加到索引数据库中,从而为全文检索系统添加多种格式文档的统一处理能力

【Abstract】 In the initial time of the Internet, the number of sites is small, the information seems easy to find. However, with the development of the Internet, the amount of the sites increase in the number of inquiries, the searching of information gets more difficult. The search engine will be created to meet the needs of information retrieval.Full-text search technology is one of the most widespread applications of information that people used every day. Through the Google, Baidu and other search engines, people search the information they need, the technology of these search engines is one of the core technology of full-text search. Full-text search in this article refers to a variety of electronic data, such as text, sound, images and other objects provided in accordance with the contents of the data rather than the outside to achieve the characteristics of the means of information retrieval. By creating a search condition contains a series of user queries; it can help people a great deal of document’s collation and management, then, people are able to quickly and easily find the information they need.The full-text search software is more mature, and has been widely used abroad, but the text in the west is very different from the text in Chinese, so full-text search software abroad is not applicable for Chinese users. Although there are some Chinese full-text searchable databases, but they are based on the essence of the relationship through the database of structured data, such as title, author, keywords, abstracts, and then obtained the full text by the link. There are rarely the real achievements of the Chinese full-text search engine.Lucene is a project of the Apache Software Foundation Jakarta project team. It is a open source full-text search engine tool package, what means it is not a complete full-text search engine, but an open source structure written by Java, which provides data access and management by simple interfaces. It can be easily embedded into applications to achieve a variety of applications for full-text search function. Lucene software development aims to provide a simple, easy-to-use tool package to facilitate in the target system to add full-text search functions, or as the basis for the establishment of a comprehensive full-text search system.Lucene only have the English and German language search function of Western, Chinese search function is not included. Therefore, while develop a Lucene-based full-text search system, a Chinese search module is neccessery.The main elements of Chinese words segmentation are: the question of segmentation is to determine the normative definition of the word which can be used as sub-word units; segmentation algorithm the problem is how to word segmentation in order to establish the actual meaning of the word boundary; segmentation ambiguity is taken to deal with the issue of what kind of methods to eliminate all differences between the justice; unknown word recognition problem is how to proceed with unknown word dictionary, such as: names of places, names of persons, and have been translated into the identification and so on. Now, there are three main research areas in the Chinese word segmentation: mechanical method, statistical methods and the method of understanding.Mechanical method is not easy to ambiguity, it’s on the base of the "sufficiently large" dictionary.on the basis of the new things are emerging in modern society, accompanied by the emergence of new vocabulary, together with the continuous introduction of foreign language translation and transliteration of the word, achieving an updated dictionary is an expensive project. As a result of the the complexity and difficulty of the general knowledge of Chinese, the understanding based of the sub-word-based system is still in the testing stage now.In order to achieve more accurate segmentation at the same time to avoid ambiguity, I use a popular sub-word based on statistical methods to improve dictionary training as part of the solution to the ambiguity of the word processing and segmentation of unknown words. In my article, the algorithm is built on a custom dictionary on the basis of this paper is not in the dictionary in the traditional sense of mechanical sub-word dictionary. In an article, the more frequent two Chinese characters appear together, the more possible they make up a word, and we define the dictionary a dictionary that based on the“bonding”between two adjacent characters.In order to make it more convenient and seamless, Lucene core has been designed very small. It is limited to the processing text format. With the development of Computer applications and networks, text format is no longer a mainstream format, a variety of file formats are used in all walks of life, such as Microsoft’s word, excel, power point format. What a full-text search engine deals with is the documents saved in different file formats.Therefore, I establish a common interface that a variety of file formats can be indexed. Through the interface, the documents of different file formats can be added to the index database. This system can be designed to avoid the differences of the documents’file formats from the users. 

【关键词】 全文检索引擎; Lucene; 中文检索; 文档格式处理
【Key words】 Full-text search engine; Lucene; Chinese search; file format accessing
  全文检索系统Lucene的分析与扩展

提要 4-7
第1章 绪论 7-10
    1.1 研究背景 7-8
    1.2 全文检索技术及其研究意义 8
    1.3 全文检索技术的研究和应用现状 8-9
    1.4 本文工作 9-10
第2章 全文检索系统理论基础 10-25
    2.1 全文检索系统的基本知识 10-12
        2.1.1 信息检索系统 10
        2.1.2 信息检索的过程 10-11
        2.1.3 传统查找的不足与倒排索引 11-12
    2.2 全文检索引擎工具包 Lucene 12-25
        2.2.1 Lucene 简介 12-13
        2.2.2 Lucene 的应用特点及优势 13-14
        2.2.3 Lucene 的源代码结构 14-21
        2.2.4 Lucene 数据流分析 21-22
        2.2.5 Lucene 索引文件格式分析 22-25
第3章 扩充中文检索模块 25-32
    3.1 中文检索的研究背景 25-26
    3.2 中文分词模块的添加 26-27
    3.3 中文检索算法 27-32
        3.3.1 预处理 28-29
        3.3.2 统计分析 29-30
        3.3.3 词典的定义与生成 30-32
第4章 多种常用文档格式处理模块 32-41
    4.1 多种格式文档处理模块研究背景 32
    4.2 多种格式文档处理模块的设计 32-33
    4.3 多种格式文档处理模块的实现 33-41
        4.3.1 多种格式文档的处理接口 33-34
        4.3.2 处理PDF 文档 34-35
        4.3.3 处理Word 文档 35-36
        4.3.4 处理Excel 文档 36-37
        4.3.5 处理Power Point 文档 37-39
        4.3.6 处理HTML 文档 39-41
第5章 总结与展望 41-42
参考文献 42-44
致谢 44-45
摘要 45-47
ABSTRACT 47-49

返回栏目页:计算机理论论文

设为主页】【收藏论文】【保存论文】【打印论文】【回到顶部】【关闭此页