Investigation Using Emails

The previous chapters discussed about the importance and the process of network forensics and the concepts involved. In this chapter, let us learn about the role of emails in digital forensics and their investigation using Python.

Role of Email in Investigation

Emails play a very important role in business communications and have emerged as one of the most important applications on internet. They are a convenient mode for sending messages as well as documents, not only from computers but also from other electronic gadgets such as mobile phones and tablets.

The negative side of emails is that criminals may leak important information about their company. Hence, the role of emails in digital forensics has been increased in recent years. In digital forensics, emails are considered as crucial evidences and Email Header Analysis has become important to collect evidence during forensic process.

An investigator has the following goals while performing email forensics −

To identify the main criminal
To collect necessary evidences
To presenting the findings
To build the case

Challenges in Email Forensics

Email forensics play a very important role in investigation as most of the communication in present era relies on emails. However, an email forensic investigator may face the following challenges during the investigation −

Fake Emails

The biggest challenge in email forensics is the use of fake e-mails that are created by manipulating and scripting headers etc. In this category criminals also use temporary email which is a service that allows a registered user to receive email at a temporary address that expires after a certain time period.

Spoofing

Another challenge in email forensics is spoofing in which criminals used to present an email as someone else’s. In this case the machine will receive both fake as well as original IP address.

Anonymous Re-emailing

Here, the Email server strips identifying information from the email message before forwarding it further. This leads to another big challenge for email investigations.

Techniques Used in Email Forensic Investigation

Email forensics is the study of source and content of email as evidence to identify the actual sender and recipient of a message along with some other information such as date/time of transmission and intention of sender. It involves investigating metadata, port scanning as well as keyword searching.

Some of the common techniques which can be used for email forensic investigation are

Header Analysis
Server investigation
Network Device Investigation
Sender Mailer Fingerprints
Software Embedded Identifiers

In the following sections, we are going to learn how to fetch information using Python for the purpose of email investigation.

Extraction of Information from EML files

EML files are basically emails in file format which are widely used for storing email messages. They are structured text files that are compatible across multiple email clients such as Microsoft Outlook, Outlook Express, and Windows Live Mail.

An EML file stores email headers, body content, attachment data as plain text. It uses base64 to encode binary data and Quoted-Printable (QP) encoding to store content information. The Python script that can be used to extract information from EML file is given below −

First, import the following Python libraries as shown below −

from __future__ import print_function
from argparse import ArgumentParser, FileType
from email import message_from_file

import os
import quopri
import base64

In the above libraries, quopri is used to decode the QP encoded values from EML files. Any base64 encoded data can be decoded with the help of base64 library.

Next, let us provide argument for command-line handler. Note that here it will accept only one argument which would be the path to EML file as shown below −

if __name__ == '__main__':
   parser = ArgumentParser('Extracting information from EML file')
   parser.add_argument("EML_FILE",help="Path to EML File", type=FileType('r'))
   args = parser.parse_args()
   main(args.EML_FILE)

Now, we need to define main() function in which we will use the method named message_from_file() from email library to read the file like object. Here we will access the headers, body content, attachments and other payload information by using resulting variable named emlfile as shown in the code given below −

def main(input_file):
   emlfile = message_from_file(input_file)
   for key, value in emlfile._headers:
      print("{}: {}".format(key, value))
print("\nBody\n")

if emlfile.is_multipart():
   for part in emlfile.get_payload():
      process_payload(part)
else:
   process_payload(emlfile[1])

Now, we need to define process_payload() method in which we will extract message body content by using get_payload() method. We will decode QP encoded data by using quopri.decodestring() function. We will also check the content MIME type so that it can handle the storage of the email properly. Observe the code given below −

def process_payload(payload):
   print(payload.get_content_type() + "\n" + "=" * len(payload.get_content_type()))
   body = quopri.decodestring(payload.get_payload())
   
   if payload.get_charset():
      body = body.decode(payload.get_charset())
else:
   try:
      body = body.decode()
   except UnicodeDecodeError:
      body = body.decode('cp1252')

if payload.get_content_type() == "text/html":
   outfile = os.path.basename(args.EML_FILE.name) + ".html"
   open(outfile, 'w').write(body)
elif payload.get_content_type().startswith('application'):
   outfile = open(payload.get_filename(), 'wb')
   body = base64.b64decode(payload.get_payload())
   outfile.write(body)
   outfile.close()
   print("Exported: {}\n".format(outfile.name))
else:
   print(body)

After executing the above script, we will get the header information along with various payloads on the console.

Analyzing MSG Files using Python

Email messages come in many different formats. MSG is one such kind of format used by Microsoft Outlook and Exchange. Files with MSG extension may contain plain ASCII text for the headers and the main message body as well as hyperlinks and attachments.

In this section, we will learn how to extract information from MSG file using Outlook API. Note that the following Python script will work only on Windows. For this, we need to install third party Python library named pywin32 as follows −

pip install pywin32

Now, import the following libraries using the commands shown −

from __future__ import print_function
from argparse import ArgumentParser

import os
import win32com.client
import pywintypes

Now, let us provide an argument for command-line handler. Here it will accept two arguments one would be the path to MSG file and other would be the desired output folder as follows −

if __name__ == '__main__':
   parser = ArgumentParser(‘Extracting information from MSG file’)
   parser.add_argument("MSG_FILE", help="Path to MSG file")
   parser.add_argument("OUTPUT_DIR", help="Path to output folder")
   args = parser.parse_args()
   out_dir = args.OUTPUT_DIR
   
   if not os.path.exists(out_dir):
      os.makedirs(out_dir)
   main(args.MSG_FILE, args.OUTPUT_DIR)

Now, we need to define main() function in which we will call win32com library for setting up Outlook API which further allows access to the MAPI namespace.

def main(msg_file, output_dir):
   mapi = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
   msg = mapi.OpenSharedItem(os.path.abspath(args.MSG_FILE))
   
   display_msg_attribs(msg)
   display_msg_recipients(msg)
   
   extract_msg_body(msg, output_dir)
   extract_attachments(msg, output_dir)

Now, define different functions which we are using in this script. The code given below shows defining the display_msg_attribs() function that allow us to display various attributes of a message like subject, to , BCC, CC, Size, SenderName, sent, etc.

def display_msg_attribs(msg):
   attribs = [
      'Application', 'AutoForwarded', 'BCC', 'CC', 'Class',
      'ConversationID', 'ConversationTopic', 'CreationTime',
      'ExpiryTime', 'Importance', 'InternetCodePage', 'IsMarkedAsTask',
      'LastModificationTime', 'Links','ReceivedTime', 'ReminderSet',
      'ReminderTime', 'ReplyRecipientNames', 'Saved', 'Sender',
      'SenderEmailAddress', 'SenderEmailType', 'SenderName', 'Sent',
      'SentOn', 'SentOnBehalfOfName', 'Size', 'Subject',
      'TaskCompletedDate', 'TaskDueDate', 'To', 'UnRead'
   ]
   print("\nMessage Attributes")
   for entry in attribs:
      print("{}: {}".format(entry, getattr(msg, entry, 'N/A')))

Now, define the display_msg_recipeints() function that iterates through the messages and displays the recipient details.

def display_msg_recipients(msg):
   recipient_attrib = ['Address', 'AutoResponse', 'Name', 'Resolved', 'Sendable']
   i = 1
   
   while True:
   try:
      recipient = msg.Recipients(i)
   except pywintypes.com_error:
      break
   print("\nRecipient {}".format(i))
   print("=" * 15)
   
   for entry in recipient_attrib:
      print("{}: {}".format(entry, getattr(recipient, entry, 'N/A')))
   i += 1

Next, we define extract_msg_body() function that extracts the body content, HTML as well as Plain text, from the message.

def extract_msg_body(msg, out_dir):
   html_data = msg.HTMLBody.encode('cp1252')
   outfile = os.path.join(out_dir, os.path.basename(args.MSG_FILE))
   
   open(outfile + ".body.html", 'wb').write(html_data)
   print("Exported: {}".format(outfile + ".body.html"))
   body_data = msg.Body.encode('cp1252')
   
   open(outfile + ".body.txt", 'wb').write(body_data)
   print("Exported: {}".format(outfile + ".body.txt"))

Next, we shall define the extract_attachments() function that exports attachment data into desired output directory.

def extract_attachments(msg, out_dir):
   attachment_attribs = ['DisplayName', 'FileName', 'PathName', 'Position', 'Size']
   i = 1 # Attachments start at 1
   
   while True:
      try:
         attachment = msg.Attachments(i)
   except pywintypes.com_error:
      break

Once all the functions are defined, we will print all the attributes to the console with the following line of codes −

print("\nAttachment {}".format(i))
print("=" * 15)
   
for entry in attachment_attribs:
   print('{}: {}'.format(entry, getattr(attachment, entry,"N/A")))
outfile = os.path.join(os.path.abspath(out_dir),os.path.split(args.MSG_FILE)[-1])
   
if not os.path.exists(outfile):
os.makedirs(outfile)
outfile = os.path.join(outfile, attachment.FileName)
attachment.SaveAsFile(outfile)
   
print("Exported: {}".format(outfile))
i += 1

After running the above script, we will get the attributes of message and its attachments in the console window along with several files in the output directory.

Structuring MBOX files from Google Takeout using Python

MBOX files are text files with special formatting that split messages stored within. They are often found in association with UNIX systems, Thunderbolt, and Google Takeouts.

In this section, you will see a Python script, where we will be structuring MBOX files got from Google Takeouts. But before that we must know that how we can generate these MBOX files by using our Google account or Gmail account.

Acquiring Google Account Mailbox into MBX Format

Acquiring of Google account mailbox implies taking backup of our Gmail account. Backup can be taken for various personal or professional reasons. Note that Google provides backing up of Gmail data. To acquire our Google account mailbox into MBOX format, you need to follow the steps given below −

Open My account dashboard.
Go to Personal info & privacy section and select Control your content link.
You can create a new archive or can manage existing one. If we click, CREATE ARCHIVE link, then we will get some check boxes for each Google product we wish to include.
After selecting the products, we will get the freedom to choose file type and maximum size for our archive along with the delivery method to select from list.
Finally, we will get this backup in MBOX format.

Python Code

Now, the MBOX file discussed above can be structured using Python as shown below −

First, need to import Python libraries as follows −

from __future__ import print_function
from argparse import ArgumentParser

import mailbox
import os
import time
import csv
from tqdm import tqdm

import base64

All the libraries have been used and explained in earlier scripts, except the mailbox library which is used to parse MBOX files.

Now, provide an argument for command-line handler. Here it will accept two arguments− one would be the path to MBOX file, and the other would be the desired output folder.

if __name__ == '__main__':
   parser = ArgumentParser('Parsing MBOX files')
   parser.add_argument("MBOX", help="Path to mbox file")
   parser.add_argument(
      "OUTPUT_DIR",help = "Path to output directory to write report ""and exported content")
   args = parser.parse_args()
   main(args.MBOX, args.OUTPUT_DIR)

Now, will define main() function and call mbox class of mailbox library with the help of which we can parse a MBOX file by providing its path −

def main(mbox_file, output_dir):
   print("Reading mbox file")
   mbox = mailbox.mbox(mbox_file, factory=custom_reader)
   print("{} messages to parse".format(len(mbox)))

Now, define a reader method for mailbox library as follows −

def custom_reader(data_stream):
   data = data_stream.read()
   try:
      content = data.decode("ascii")
   except (UnicodeDecodeError, UnicodeEncodeError) as e:
      content = data.decode("cp1252", errors="replace")
   return mailbox.mboxMessage(content)

Now, create some variables for further processing as follows −

parsed_data = []
attachments_dir = os.path.join(output_dir, "attachments")

if not os.path.exists(attachments_dir):
   os.makedirs(attachments_dir)
columns = [
   "Date", "From", "To", "Subject", "X-Gmail-Labels", "Return-Path", "Received", 
   "Content-Type", "Message-ID","X-GM-THRID", "num_attachments_exported", "export_path"]

Next, use tqdm to generate a progress bar and to track the iteration process as follows −

for message in tqdm(mbox):
   msg_data = dict()
   header_data = dict(message._headers)
for hdr in columns:
   msg_data[hdr] = header_data.get(hdr, "N/A")

Now, check weather message is having payloads or not. If it is having then we will define write_payload() method as follows −

if len(message.get_payload()):
   export_path = write_payload(message, attachments_dir)
   msg_data['num_attachments_exported'] = len(export_path)
   msg_data['export_path'] = ", ".join(export_path)

Now, data need to be appended. Then we will call create_report() method as follows −

parsed_data.append(msg_data)
create_report(
   parsed_data, os.path.join(output_dir, "mbox_report.csv"), columns)
def write_payload(msg, out_dir):
   pyld = msg.get_payload()
   export_path = []
   
if msg.is_multipart():
   for entry in pyld:
      export_path += write_payload(entry, out_dir)
else:
   content_type = msg.get_content_type()
   if "application/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "image/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))

   elif "video/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "audio/" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "text/csv" in content_type.lower():
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "info/" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   elif "text/calendar" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   elif "text/rtf" in content_type.lower():
      export_path.append(export_content(msg, out_dir,
      msg.get_payload()))
   else:
      if "name=" in msg.get('Content-Disposition', "N/A"):
         content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
   elif "name=" in msg.get('Content-Type', "N/A"):
      content = base64.b64decode(msg.get_payload())
      export_path.append(export_content(msg, out_dir, content))
return export_path

Observe that the above if-else statements are easy to understand. Now, we need to define a method that will extract the filename from the msg object as follows −

def export_content(msg, out_dir, content_data):
   file_name = get_filename(msg)
   file_ext = "FILE"
   
   if "." in file_name: file_ext = file_name.rsplit(".", 1)[-1]
   file_name = "{}_{:.4f}.{}".format(file_name.rsplit(".", 1)[0], time.time(), file_ext)
   file_name = os.path.join(out_dir, file_name)

Now, with the help of following lines of code, you can actually export the file −

if isinstance(content_data, str):
   open(file_name, 'w').write(content_data)
else:
   open(file_name, 'wb').write(content_data)
return file_name

Now, let us define a function to extract filenames from the message to accurately represent the names of these files as follows −

def get_filename(msg):
   if 'name=' in msg.get("Content-Disposition", "N/A"):
      fname_data = msg["Content-Disposition"].replace("\r\n", " ")
      fname = [x for x in fname_data.split("; ") if 'name=' in x]
      file_name = fname[0].split("=", 1)[-1]
   elif 'name=' in msg.get("Content-Type", "N/A"):
      fname_data = msg["Content-Type"].replace("\r\n", " ")
      fname = [x for x in fname_data.split("; ") if 'name=' in x]
      file_name = fname[0].split("=", 1)[-1]
   else:
      file_name = "NO_FILENAME"
   fchars = [x for x in file_name if x.isalnum() or x.isspace() or x == "."]
   return "".join(fchars)

Now, we can write a CSV file by defining the create_report() function as follows −

def create_report(output_data, output_file, columns):
   with open(output_file, 'w', newline="") as outfile:
      csvfile = csv.DictWriter(outfile, columns)
      csvfile.writeheader()
      csvfile.writerows(output_data)

Once you run the script given above, we will get the CSV report and directory full of attachments.