I have a script that I built following the blueprint in the book Learning Python for Forensics. The script will go through a directory specified by the user and collect the metadata for each file in the directory. The results are saved to a sqlite database and also written to a CSV or HTML file.
The script was originally written in Python 2.7.15. I am trying to update the code for Python 3.7. However, there is one line in the ingest directory function that is giving me issues.
The ingestDirectory
function looks like this:
def ingestDirectory(cur, source, custodian_id):
count = 0
for root, folders, files in os.walk(source):
for file_name in files:
meta_data = dict()
try:
meta_data['file_name'] = file_name
meta_data['file_path'] = os.path.join(root, file_name)
meta_data['extension'] = os.path.splitext(file_name)[-1]
file_stats = os.stat(meta_data['file_path'])
meta_data['mode'] = oct(file_stats.st_mode)
meta_data['inode'] = int(file_stats.st_ino)
meta_data['file_size'] = int(file_stats.st_size)
meta_data['atime'] = formatTimestamp(file_stats.st_atime)
meta_data['mtime'] = formatTimestamp(file_stats.st_mtime)
meta_data['ctime'] = formatTimestamp(file_stats.st_ctime)
except Exception as e:
logging.error('Could not gather data for file: ' + meta_data['file_path'] + e.__str__())
meta_data['custodian'] = custodian_id
columns = '","'.join(meta_data.keys())
values = '","'.join(str(x).encode('string_escape') for x in meta_data.values())
sql = 'INSERT INTO Files ("' + columns + '") VALUES ("' + values + '")'
cur.execute(sql)
count += 1
The line that is giving me errors is this:
values = '","'.join(str(x).encode('string_escape') for x in meta_data.values())
This line is meant to handle any string escape characters found in metadata.values
before writing the data to the database.
When I tried to run this code in Python 3 I got an error about an unrecognized codec. I did some research hear on Stack Overflow and found that string_escape
has been replaced with unicode-escape
in Python 3.
I am fairly new to Python 3 and Unicode. My question is this:
How do I update the line above so that it uses unicode-escape
instead of string_escape
and produces the same result as the Python 2.7 code?
Any help would be appreciated! I have been working on this for several days now and every solution I try just results in more error codes or corrupted output files.
You are generating SQL in that piece of code, and escaping was there to try and produce valid SQL. It's a very poor-man's attempt at avoiding SQL injection. It's not very effective, and isn't needed as the database driver already knows how to take care of this, in a much safer fashion!
For SQL databases, the correct way to put the values into SQL parameters. SQL parameters consist of two components: placeholders, and values passed in separately to the .execute()
method for the database to cleanly handle. The sqlite3
library is no exception, see the cursor.execute()
method for details. For your case, you can use named placeholders:
columns = [f'''"{name.replace('"', '""')}"''' for name in meta_data]
placeholders = [f':{name}' for name in meta_data]
sql = f'INSERT INTO Files ({", ".join(columns)}) VALUES ({", ".join(placeholders)})'
cur.execute(sql, meta_data)
Note that meta_data
is passed as a second argument; the database takes each :name
placeholder and takes the value for that placeholder from the meta_data
dictionary.
I also properly formatted the column names, by putting double quotes around them and doubling any "
characters in the name; see the SQLite keyword documentation:
'keyword' A keyword in single quotes is a string literal.
"keyword" A keyword in double-quotes is an identifier.
Your code has hardcoded those column names and none of them are reserved SQL keywords, so they don't really need this protection, but it is still good practice.
For your code, where meta_data
has a fixed number of keys, the above builds this sql
string:
>>> columns = [f'''"{name.replace('"', '""')}"''' for name in meta_data]
>>> placeholders = [f':{name}' for name in meta_data]
>>> sql = f'INSERT INTO Files ({", ".join(columns)}) VALUES ({", ".join(placeholders)})'
>>> from pprint import pprint
>>> pprint(sql)
('INSERT INTO Files ("file_name", "file_path", "extension", "mode", "inode", '
'"file_size", "atime", "mtime", "ctime") VALUES (:file_name, :file_path, '
':extension, :mode, :inode, :file_size, :atime, :mtime, :ctime)')
I'd also change the way you log the error, instead of
logging.error('Could not gather data for file: ' + meta_data['file_path'] + e.__str__())
I'd use
logging.exception('Could not gather data for file: %s', meta_data['file_path'])
and leave the error gathering to the logging framework. Even if you do include the exception object, use str(e)
or a %s
placeholder.