如何搜索

您所在的位置：网站首页 › whoosh什么牌子 › 如何搜索

如何搜索

2024-01-20 20:45| 来源: 网络整理| 查看: 265

如何搜索¶

创建索引并向其中添加文档后，可以搜索这些文档。

Searcher 对象¶

得到一个 whoosh.searching.Searcher 调用对象 searcher() 对你 Index 对象：

searcher = myindex.searcher()

您通常希望使用 with 语句，这样当您完成搜索时，搜索者将自动关闭（搜索者对象表示许多打开的文件，因此如果您没有显式关闭它们，并且系统收集它们的速度很慢，则可能会耗尽文件句柄）：：

with ix.searcher() as searcher: ...

这当然等同于：

try: searcher = ix.searcher() ... finally: searcher.close()

这个 Searcher 对象是读取索引的主要高级接口。它有许多有用的方法来获取有关索引的信息，例如 lexicon(fieldname) .

>>> list(searcher.lexicon("content")) [u"document", u"index", u"whoosh"]

然而，最重要的方法是 Searcher 对象是 search() ，需要一个 whoosh.query.Query 对象并返回 Results 对象：

from whoosh.qparser import QueryParser qp = QueryParser("content", schema=myindex.schema) q = qp.parse(u"hello world") with myindex.searcher() as s: results = s.search(q)

默认情况下，结果最多包含前10个匹配文档。要获得更多结果，请使用 limit 关键词：

results = s.search(q, limit=20)

如果需要所有结果，请使用 limit=None . 但是，尽可能设置限制会加快搜索速度，因为whoosh不需要检查和评分每个文档。

由于一次显示一页结果是一种常见的模式，因此 search_page 方法使您可以方便地只检索给定页上的结果：

results = s.search_page(q, 1)

默认页面长度为10次点击。你可以使用 pagelen 设置不同页面长度的关键字参数：

results = s.search_page(q, 5, pagelen=20) 结果对象¶

这个 Results 对象的作用类似于匹配文档的列表。您可以使用它访问每个HIT文档的存储字段，以显示给用户。

>>> # Show the best hit's stored fields >>> results[0] {"title": u"Hello World in Python", "path": u"/a/b/c"} >>> results[0:2] [{"title": u"Hello World in Python", "path": u"/a/b/c"}, {"title": u"Foo", "path": u"/bar"}]

默认情况下， Searcher.search(myquery) 将命中数限制为20，因此 Results 对象可能小于索引中匹配的文档数。

>>> # How many documents in the entire index would have matched? >>> len(results) 27 >>> # How many scored and sorted documents in this Results object? >>> # This will often be less than len() if the number of hits was limited >>> # (the default). >>> results.scored_length() 10

调用 len(Results) 再次运行查询的快速（不计分）版本，以计算匹配文档的总数。这通常非常快，但对于大索引，它会导致明显的延迟。如果要避免在非常大的索引上出现这种延迟，可以使用 has_exact_length() ， estimated_length() 和 estimated_min_length() 方法在不调用 len() ：：

found = results.scored_length() if results.has_exact_length(): print("Scored", found, "of exactly", len(results), "documents") else: low = results.estimated_min_length() high = results.estimated_length() print("Scored", found, "of between", low, "and", high, "documents") 评分和排序¶ 得分¶

结果文档列表通常按分数. 这个 whoosh.scoring 模块包含各种评分算法的实现。默认值为 BM25F .

可以设置要在使用创建搜索者时使用的记分对象 weighting 关键字参数：

from whoosh import scoring with myindex.searcher(weighting=scoring.TF_IDF()) as s: ...

加权模型是 WeightingModel 类的子类 scorer() 生成“Scorer”实例的方法。此实例具有一个方法，该方法获取当前匹配器并返回浮点分数。

分选¶

见排序和分面 .

突出显示代码段等等¶

见如何创建突出显示的搜索结果摘要和查询扩展和关键字提取有关这些主题的信息。

筛选结果¶

你可以使用 filter 关键字参数 search() 指定要在结果中允许的一组文档。参数可以是 whoosh.query.Query 对象，A whoosh.searching.Results 对象，或包含文档编号的类似集合的对象。搜索者缓存筛选器，因此，例如，如果多次对搜索者使用同一查询筛选器，则其他搜索将更快，因为搜索者将缓存运行筛选器查询的结果。

还可以指定 mask 关键字参数指定结果中不允许的一组文档。

with myindex.searcher() as s: qp = qparser.QueryParser("content", myindex.schema) user_q = qp.parse(query_string) # Only show documents in the "rendering" chapter allow_q = query.Term("chapter", "rendering") # Don't show any documents where the "tag" field contains "todo" restrict_q = query.Term("tag", "todo") results = s.search(user_q, filter=allow_q, mask=restrict_q)

（如果您同时指定 filter 和A mask ，并且匹配的文档同时出现在 mask “wins”且不允许使用文档。）

要了解从结果中筛选出多少个结果，请使用 results.filtered_count （或） resultspage.results.filtered_count ）：

with myindex.searcher() as s: qp = qparser.QueryParser("content", myindex.schema) user_q = qp.parse(query_string) # Filter documents older than 7 days old_q = query.DateRange("created", None, datetime.now() - timedelta(days=7)) results = s.search(user_q, mask=old_q) print("Filtered out %d older documents" % results.filtered_count) 我查询的哪些术语匹配？¶

你可以使用 terms=True 关键字参数 search() 要使查询中的搜索记录与哪些文档匹配，请执行以下操作：

with myindex.searcher() as s: results = s.seach(myquery, terms=True)

然后，您可以从 whoosh.searching.Results 和 whoosh.searching.Hit 物体：：

# Was this results object created with terms=True? if results.has_matched_terms(): # What terms matched in the results? print(results.matched_terms()) # What terms matched in each hit? for hit in results: print(hit.matched_terms()) 折叠结果¶

whoosh允许您从结果中删除除具有相同方面键的前n个文档以外的所有文档。这在以下几种情况下很有用：

在搜索时消除重复项。

限制每个源的匹配数。例如，在Web搜索应用程序中，您可能希望显示任何网站最多三个匹配项。

文档是否应折叠取决于“折叠方面”的值。如果一个文档有一个空的折叠键，它将永远不会折叠，否则结果中只会出现具有相同折叠键的前n个文档。

见排序和分面 for information on facets.

with myindex.searcher() as s: # Set the facet to collapse on and the maximum number of documents per # facet value (default is 1) results = s.collector(collapse="hostname", collapse_limit=3) # Dictionary mapping collapse keys to the number of documents that # were filtered out by collapsing on that key print(results.collapsed_counts)

折叠同时具有评分和排序结果。您可以使用 whoosh.sorting 模块。

默认情况下，whoosh使用结果顺序（分数或排序键）来确定要折叠的文档。例如，在计分结果中，最好的计分文件将被保留。您可以选择指定 collapse_order 控制折叠时保留哪些文档的方面。

例如，在产品搜索中，您可以显示按降价排序的结果，并删除除每种产品类型的最高评级项目以外的所有项目：

from whoosh import sorting with myindex.searcher() as s: price_facet = sorting.FieldFacet("price", reverse=True) type_facet = sorting.FieldFacet("type") rating_facet = sorting.FieldFacet("rating", reverse=True) results = s.collector(sortedby=price_facet, # Sort by reverse price collapse=type_facet, # Collapse on product type collapse_order=rating_facet # Collapse to highest rated )

崩溃发生在搜索过程中，因此通常比查找所有内容和对结果进行后处理更有效。但是，如果折叠消除了大量文档，则折叠搜索可能需要更长的时间，因为搜索必须考虑更多文档并删除许多已收集的文档。

由于此收集器有时必须返回并删除已收集的文档，如果将其与 TermsCollector 和/或 FacetCollector , those collectors may contain information about documents that were filtered out of the final results by collapsing.

限时搜索¶

要限制搜索所需的时间，请执行以下操作：

from whoosh.collectors import TimeLimitCollector, TimeLimit with myindex.searcher() as s: # Get a collector object c = s.collector(limit=None, sortedby="title_exact") # Wrap it in a TimeLimitedCollector and set the time limit to 10 seconds tlc = TimeLimitedCollector(c, timelimit=10.0) # Try searching try: s.search_with_collector(myquery, tlc) except TimeLimit: print("Search took too long, aborting!") # You can still get partial results from the collector results = tlc.results() 方便方法¶

这个 document() 和 documents() 方法论 Searcher 对象用于检索与关键字参数中传递的术语相匹配的文档存储字段。

这对于日期/时间、标识符、路径等字段尤其有用。

>>> list(searcher.documents(indexeddate=u"20051225")) [{"title": u"Christmas presents"}, {"title": u"Turkey dinner report"}] >>> print searcher.document(path=u"/a/b/c") {"title": "Document C"}

这些方法有一些局限性：

结果不计分。

多个关键字总是一起使用。

每个关键字参数的整个值都被视为单个术语；不能在同一字段中搜索多个术语。

组合结果对象¶

有时使用另一个查询的结果来影响 whoosh.searching.Results 对象。

例如，您可能有一个“最佳匹配”字段。此字段包含手动选取的文档关键字。当用户搜索这些关键字时，您希望将这些文档放在结果列表的顶部。你可以尝试通过极大地提升“最佳匹配”字段来做到这一点，但这会对得分产生不可预测的影响。只需运行两次查询并组合结果就容易多了：

# Parse the user query userquery = queryparser.parse(querystring) # Get the terms searched for termset = set() userquery.existing_terms(termset) # Formulate a "best bet" query for the terms the user # searched for in the "content" field bbq = Or([Term("bestbet", text) for fieldname, text in termset if fieldname == "content"]) # Find documents matching the searched for terms results = s.search(bbq, limit=5) # Find documents that match the original query allresults = s.search(userquery, limit=10) # Add the user query results on to the end of the "best bet" # results. If documents appear in both result sets, push them # to the top of the combined results. results.upgrade_and_extend(allresults)

这个 Results 对象支持以下方法：

Results.extend(results)

将“results”中的文档添加到结果文档列表的末尾。

Results.filter(results)

Removes the documents in 'results' from the list of result documents.

Results.upgrade(results)

也出现在“结果”中的任何结果文档都将移动到结果文档列表的顶部。

Results.upgrade_and_extend(results)

也出现在“结果”中的任何结果文档都将移动到结果文档列表的顶部。然后将“结果”中的任何其他文档添加到结果文档列表中。

【本文地址】

公司简介

联系我们

今日新闻

推荐新闻

专题文章