How to improve file reading performance in Python with MMAP function?

PythonServer Side ProgrammingProgramming

Introduction..

MMAP abbreviated as memory mapping when mapped to a file uses the operating systems virtual memory to access the data on the file system directly, instead of accessing the data with the normal I/O functions. There by improving the I/O performance as it does not require either making a separate system call for each access or copying data between buffers.

To a matter of fact anything in memory for instance a SQLlite database when created in-memeory has better performance compared to on disk.

Memory-mapped files can be treated as mutable strings or file-like objects, depending on what you want to do.

MMAP supports many methods, such as close(), flush(), read(), readline(), seek(), tell(), write() and can very well work with slice operations and even regular expressions.

How to do it..

1. Assume a text file with below contents. You can get this text by just using Google and searching for sample text. Copy these contents to a input.txt file.

Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.

Id porro facete cum. No est veritus detraxit facilisis, sit ea clita decore essent. Ut eam labores fuisset menandri, ex sit brute viderer eleifend, altera argumentum vel ex. Duo at zril sensibus, eu vim ullum assentior, quando possit at his.

Te nam tempor posidonium scripserit, eam mundi reprimique dissentias ne. Vim te soleat offendit democritum. Nam an diam elaboraret, quaeque dissentias an has. Autem legendos dignissim ad vis, sea ex amet petentium reprehendunt, inermis constituam philosophia ne mel. Esse noster lobortis usu ne.

Nec reque postea urbanitas ut, mea in nulla invidunt ocurreret. Ei duo iuvaret numquam. Ferri nemore audire te est, mel et detracto noluisse. Nec eu habeo justo, id pro posse apeirian volutpat. Mea sonet quaestio ne.

Atqui quaeque alienum te vim. Graeco aliquip liberavisse pro ut. Te similique reformidans usu, te mundi aliquando ius. Meis scripta minimum quo no, meis prima fabellas eu eam, laoreet delicata forensibus ut vim. Et quo vocibus mediocritatem, atqui summo an eam.

2. We will Use the mmap() function to create a memory-mapped file. We can pass the filename either by fileno() method of a file object or from os.open().

Note: The user is responsible for opening the file before invoking mmap(), and closing it.

The second argument to mmap() is a size in bytes indicating the portion of the file to map. If the value is 0, the entire file is mapped. There is also an additional argument you can use which is ACCESS_READ for read-only access, ACCESS_WRITE for write-through access, and ACCESS_COPY for copy on write access.

import mmap

input_text = """Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.

Id porro facete cum. No est veritus detraxit facilisis, sit ea clita decore essent. Ut eam labores fuisset menandri, ex sit brute viderer eleifend, altera argumentum vel ex. Duo at zril sensibus, eu vim ullum assentior, quando possit at his.

Te nam tempor posidonium scripserit, eam mundi reprimique dissentias ne. Vim te soleat offendit democritum. Nam an diam elaboraret, quaeque dissentias an has. Autem legendos dignissim ad vis, sea ex amet petentium reprehendunt, inermis constituam philosophia ne mel. Esse noster lobortis usu ne.

Nec reque postea urbanitas ut, mea in nulla invidunt ocurreret. Ei duo iuvaret numquam. Ferri nemore audire te est, mel et detracto noluisse. Nec eu habeo justo, id pro posse apeirian volutpat. Mea sonet quaestio ne.

Atqui quaeque alienum te vim. Graeco aliquip liberavisse pro ut. Te similique reformidans usu, te mundi aliquando ius. Meis scripta minimum quo no, meis prima fabellas eu eam, laoreet delicata forensibus ut vim. Et quo vocibus mediocritatem, atqui summo an eam.

"""

# create a inout file with some text
input_file = 'input.txt'
f = open(input_file, "w+")
f.write(input_text)
f.close()

#Open the file in read mode
with open(input_file, 'r') as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as m:
print(f"Output \n*** Output first 5 bytes of the {input_file} is {m.read(5)} ")
print(f"*** Output Next 10 bytes of the {input_file} is {m.read(10)} ")

Output

*** Output first 5 bytes of the input.txt is b'Lorem'
*** Output Next 10 bytes of the input.txt is b' ipsum dol'

3. We have read the file and mapped to memory and used .read() to read the first 5 bytes. So the file pointer moves ahead of 10 bytes after first read. Now if you do one more read lets say read(10) bytes it gives you bytes from 6 - 15.

4. To set up the memory mapped file to update, open it for 'r+' (not 'w') before mapping it.

I will show you with an example how to modify the part of a line in-place.

import mmap
import shutil

input_file = 'input.txt'
input_copy = input_file.replace('input','input_copy')

# Make a Copy of the file just to make sure original is un-modified.
shutil.copyfile(input_file,input_copy)

# word
word = b'ipsum'

# modified word
modified_word = word[::-1]

# Open the file to receive updates
with open(input_copy, 'r+') as f:
with mmap.mmap(f.fileno(), 0) as m:
print(f"output \n *** Line before updates \n {m.readline().rstrip()}")

# Rewind using seek
m.seek(0)

# find the word and reverse it
loc = m.find(word)
m[loc:loc + len(word)] = modified_word
m.flush()

# Rewind using seek
m.seek(0)
print(f" \n *** Line after updates \n {m.readline().rstrip()}")

f.seek(0)
print(f" \n *** Final file \n {f.readline().rstrip()}")

Output

*** Line before updates
b'Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.'

*** Line after updates
b'Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.'

*** Final file
Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.

5. The word “ipsum” is replaced in the middle of the first line in memory and in the file.

6. If, for whatever reason you want to see the changes in-memory and don't want to update the file on disk, then use ACCESS_COPY.

import mmap
import shutil

input_file = 'input.txt'
input_copy = input_file.replace('input','input_copy')

# Make a Copy of the file just to make sure original is un-modified.
shutil.copyfile(input_file,input_copy)

# word
word = b'ipsum'

# modified word
modified_word = word[::-1]

# Open the file to receive updates
with open(input_copy, 'r+') as f:
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_COPY) as m:
print(f"output \n *** Line before updates \n {m.readline().rstrip()}")

# Rewind using seek
m.seek(0)

# find the word and reverse it
loc = m.find(word)
m[loc:loc + len(word)] = modified_word
m.flush()

# Rewind using seek
m.seek(0)
print(f" \n *** Line after updates \n {m.readline().rstrip()}")

f.seek(0)
print(f" \n *** Final file \n {f.readline().rstrip()}")

Output

*** Line before updates
b'Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.'

*** Line after updates
b'Lorem muspi dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.'

*** Final file
Lorem ipsum dolor sit amet, causae apeirian ea his, duo cu congue prodesset. Ut epicuri invenire duo, novum ridens eu has, in natum meliore noluisse sea. Has ei stet explicari. No nam eirmod deterruisset, nusquam electram rationibus ad sea, interesset delicatissimi et sit. Purto molestiae cu eum, in per hinc periculis intellegam.

7. Observe, the content in the input and output which are unchanged, while the changes applied only to in-memory.

raja
Published on 09-Nov-2020 11:01:22
Advertisements