Developing a Text Search Engine using the Whoosh Library in Python


Whoosh is a python library of classes and functions for indexing text and then searching the index. Suppose you are building an application that needs to go through various documents and then find similarities or get data from it based on a few predefined conditions, or let's say you want to count the number of times the title of the project is mentioned in a research paper, then what we are building in this tutorial will come in handy.

Getting Started

For building our text search engine, we will be working with the whoosh library.

This library does not come pre−packaged with Python. So, we’ll be downloading and installing it using the pip package manager.

To install the whoosh library, use the below line.

pip install whoosh

And now, we can import it to our script using the below line.

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

Building a Text Search Engine using Python

First, let us define a folder where we will be saving the indexed files when needed.

import os.path
os.mkdir("dir")

Next up, let us define a schema. Schema specifies the fields of documents in an index.

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()

Now that we’ve indexed the document, we search it.

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())

Output

It will produce the following output:

<Hit {'path': '/a', 'title': 'doc', 'content': 'Py doc hello big world'}> 
1.7906976744186047
{('content', b'hello'), ('content', b'world')}

Example

Here is the complete code:

from whoosh.fields import Schema, TEXT, ID
from whoosh import index
import os.path
os.mkdir("dir")
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()

from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())

Conclusion

You have now learnt to create text search engines in Python. Using this you can search through various documents to extract useful content within seconds. You’ve also explored the potential of the Whoosh library in Python.

Updated on: 31-Aug-2023

512 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements