How to scan for a string in multiple document formats (CSV, Text, MS Word) with Python?


Problem..

Assume you have a directory full of files with different formats, to search for a particular keyword.

Getting ready..

Install below packages.

1. beautifulsoup4

2. python-docx

How to do it...

1. Write a function to search for a string in CSV format. I will be using csv.reader module to go through the file and search for the string and return True when found else False.

Example

def csv_stringsearch(input_file, input_string):
"""
Function: search a string in csv files.
args: input file , input string
"""
with open(input_file) as file:
for row in csv.reader(file):
for column in row:
if input_string in column.lower():
return True
return False

2. Function to search a text file. This is a bit tricky as we need to deal with encoding. There are thousands of encoding and determing the encoding format is propably the toughest part. Offcourse, we can go back to the user who created the text file but hey we are automating it right.

So, we will use UnicodeDammit to determine the encoding.

Example

def text_stringsearch(input_file, input_string):
"""
Function: search a string in text files.
args: input file , input string
"""
with open(filename, 'rb') as file:
content = file.read(1024)

guessencoding = UnicodeDammit(content)
encoding = guessencoding.original_encoding

# Open and read
with open(input_file, encoding=encoding) as file:
for line in file:
if input_string in line.lower():
return True

return False

3. Function to search for a string in a MS Word document.

Example

def MSDocx_stringsearch(input_file, input_string):
"""
Function: search a string in MS Word documents.
args: input file , input string
"""
doc = docx.Document(input_file)
for paragraph in doc.paragraphs:
if input_string in paragraph.text.lower():
return True
return False

4. Now, we need to have main function to loop through the files and call the corresponsing functions with the string to search. Here I assume the code and input files to search are in the same directory. You can add in path parameter if you directory is in a different location.

Example

def main(input_string):
"""
Function: Open the current directory and search for a string in all the files
args: input string
"""
for root, dirs, files in os.walk('.'):
for file in files:

# Get the file extension
extension = file.split('.')[-1]

if extension in function_maps:
search_file = function_maps.get(extension)
full_file_path = os.path.join(root, file)

if search_file(full_file_path, input_string):
print(f' *** Yeah String found in {full_file_path}')

5. Map our functions to the file extensions by creating a dictionary.

Example

EXTENSIONS ={
'csv': csv_stringsearch,
'txt': text_stringsearch,
'docx': MSDocx_stringsearch,
}

Example

6. Putting it all together.

import os
import argparse
import csv
import docx
from bs4 import UnicodeDammit


def csv_stringsearch(input_file, input_string):
"""
Function: search a string in csv files.
args: input file , input string
"""
with open(input_file) as file:
for row in csv.reader(file):
for column in row:
if input_string in column.lower():
return True
return False


def MSDocx_stringsearch(input_file, input_string):
"""
Function: search a string in MS Word documents.
args: input file , input string
"""
doc = docx.Document(input_file)
for paragraph in doc.paragraphs:
if input_string in paragraph.text.lower():
return True

return False

def text_stringsearch(input_file, input_string):
"""
Function: search a string in text files.
args: input file , input string
"""
with open(input_file, 'rb') as file:
content = file.read(1024)

guessencoding = UnicodeDammit(content)
encoding = guessencoding.original_encoding

# Open and read
with open(input_file, encoding=encoding) as file:
for line in file:
if input_string in line.lower():
return True

return False

def main(input_string):
"""
Function: Open the current directory and search for a string in all the files
args: input string
"""
for root, dirs, files in os.walk('.'):
for file in files:

# Get the file extension
extension = file.split('.')[-1]

if extension in function_mapping:
search_file = function_mapping.get(extension)
full_file_path = os.path.join(root, file)

if search_file(full_file_path, input_string):
print(f' *** Yeah String found in {full_file_path}')

function_mapping = {
'csv': csv_stringsearch,
'txt': text_stringsearch,
'docx': MSDocx_stringsearch,
}

if __name__ == '__main__':
string_to_search = 'Hello'
print(f'Output \n')
main(string_to_search.lower())

Output

*** Yeah String found in .\Hello_World.docx
*** Yeah String found in .\My_Amazing_WordDoc.docx

7. In case you want to change the program to command line execution then use argparse.

Example

if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-s', type=str, help='Input string to search', default='Hello')
args = parser.parse_args()
main(args.s.lower())

Updated on: 10-Nov-2020

625 Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements