Django中使用Whoosh进行全文检索的方法-侯体宗的博客

Django中使用Whoosh进行全文检索的方法
框架(架构) / 管理员发布于 8年前 233

Whoosh 是纯Python实现的全文搜索引擎，通过Whoosh可以很方便的给文档加上全文索引功能。

什么是全文检索

简单讲分为两块，一块是分词，一块是搜索。比如下面一段话：

上次舞蹈演出直接在上海路的弄堂里

比如我们现在想检索上次的演出，通常我们会直接搜索关键词：上次演出，但是使用传统的SQL like 查询并不能命中上面的这段话，因为在上次和演出中间还有舞蹈。然而全文搜索却将上文切成一个个Token，类似：

上次/舞蹈/演出/直接/在/上海路/的/弄堂/里

切分成Token后做反向索引(inverted indexing)，这样我们就可以通过关键字很快查询到了结果了。

解决分词问题

分词是个很有技术难度的活，比如上面的语句中一个难点就是到底是上海路还是上海呢？Python有个中文分词库：结巴分词，我们可以通过结巴分词来完成索引中分词工作，结巴分词提供了Whoosh的组件可以直接集成，代码示例

遇到的问题

如果是在一些VPS上测试的时候非常慢的话可能是内存不足，比如512MB做一个博客索引非常慢，尝试升级到1GB后可以正常使用了。

代码

import loggingimport osimport shutilfrom django.conf import settingsfrom whoosh.fields import Schema, ID, TEXT, NUMERICfrom whoosh.index import create_in, open_dirfrom whoosh.qparser import MultifieldParserfrom jieba.analyse import ChineseAnalyzerfrom .models import Articlelog = logging.getLogger(__name__)index_dir = os.path.join(settings.BASE_DIR, "whoosh_index")indexer = open_dir(index_dir)def articles_search(keyword):  mp = MultifieldParser(    ['content', 'title'], schema=indexer.schema, fieldboosts={'title': 5.0})  query = mp.parse(keyword)  with indexer.searcher() as searcher:    results = searcher.search(query, limit=15)    articles = []    for hit in results:      log.debug(hit)      articles.append({        'id': hit['id'],        'slug': hit['slug'],      })  return articlesdef rebuild():  if os.path.exists(index_dir):    shutil.rmtree(index_dir)  os.makedirs(index_dir)  analyzer = ChineseAnalyzer()  schema = Schema(    id=ID(stored=True, unique=True),    slug=TEXT(stored=True),    title=TEXT(),    content=TEXT(analyzer=analyzer))  indexer = create_in(index_dir, schema)  __index_all_articles()def __index_all_articles():  writer = indexer.writer()  published_articles = Article.objects.exclude(is_draft=True)  for article in published_articles:    writer.add_document(      id=str(article.id),      slug=article.slug,      title=article.title,      content=article.content,    )  writer.commit()def article_update_index(article):  '''  updating an article to indexer, adding if not.  '''  writer = indexer.writer()  writer.update_document(    id=str(article.id),    slug=article.slug,    title=article.title,    content=article.content,  )  writer.commit()def article_delete_index(article):  writer = indexer.writer()  writer.delete_by_term('id', str(article.id))  writer.commit()

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

上一条：
Django异步任务之Celery的基本使用
下一条：
Django实现单用户登录的方法示例

0条评论 (评论内容有缓存机制,请悉知!)

最新最热

相关文章
Filament v3.1版本发布(0个评论)
docker + gitea搭建一个git服务器流程步骤(0个评论)
websocket的三种架构方式使用优缺点浅析(0个评论)
ubuntu20.4系统中宿主机安装nginx服务，docker容器中安装php8.2实现运行laravel10框架网站(0个评论)
phpstudy_pro(小皮面板)中安装最新php8.2.9版本流程步骤(0个评论)

近期评论
test1 在
opencode + Oh-my-openagent,我的第一个免费的ai编程智能体管家:Sisyphus中评论 test..
122 在
学历：一种延缓就业设计，生活需求下的权衡之选中评论工作几年后，报名考研了，到现在还没认真学习备考，迷茫中。作为一名北漂互联网打工人..
Zita 在
Google AI Studio升级全栈 vibe coding体验，可直接构建带登录和数据库的应用中评论 111222..
123 在
Clash for Windows作者删库跑路了，github已404中评论按理说只要你在国内，所有的流量进出都在监控范围内，不管你怎么隐藏也没用，想搞你分..
原梓番博客在
在Laravel框架中使用模型Model分表最简单的方法中评论好久好久都没看友情链接申请了，今天刚看，已经添加。..

Top