 
 Data Structure Data Structure
 Networking Networking
 RDBMS RDBMS
 Operating System Operating System
 Java Java
 MS Excel MS Excel
 iOS iOS
 HTML HTML
 CSS CSS
 Android Android
 Python Python
 C Programming C Programming
 C++ C++
 C# C#
 MongoDB MongoDB
 MySQL MySQL
 Javascript Javascript
 PHP PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Developing a Text Search Engine using the Whoosh Library in Python
Whoosh is a python library of classes and functions for indexing text and then searching the index. Suppose you are building an application that needs to go through various documents and then find similarities or get data from it based on a few predefined conditions, or let's say you want to count the number of times the title of the project is mentioned in a research paper, then what we are building in this tutorial will come in handy.
Getting Started
For building our text search engine, we will be working with the whoosh library.
This library does not come pre?packaged with Python. So, we'll be downloading and installing it using the pip package manager.
To install the whoosh library, use the below line.
pip install whoosh
And now, we can import it to our script using the below line.
from whoosh.fields import Schema, TEXT, ID from whoosh import index
Building a Text Search Engine using Python
First, let us define a folder where we will be saving the indexed files when needed.
import os.path
os.mkdir("dir")
Next up, let us define a schema. Schema specifies the fields of documents in an index.
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()
Now that we've indexed the document, we search it.
from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())
Output
It will produce the following output:
<Hit {'path': '/a', 'title': 'doc', 'content': 'Py doc hello big world'}> 
1.7906976744186047
{('content', b'hello'), ('content', b'world')}
Example
Here is the complete code:
from whoosh.fields import Schema, TEXT, ID
from whoosh import index
import os.path
os.mkdir("dir")
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))
ind = index.create_in("dir", schema)
writer = ind.writer()
writer.add_document(title=u"doc", content=u"Py doc hello big world", path=u"/a") 
writer.commit()
from whoosh.qparser import QueryParser
with ind.searcher() as searcher:
     query = QueryParser("content", ind.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     for r in results:
         print (r, r.score)
         if results.has_matched_terms():
            print(results.matched_terms())
Conclusion
You have now learnt to create text search engines in Python. Using this you can search through various documents to extract useful content within seconds. You've also explored the potential of the Whoosh library in Python.
