I'm developing a web app (using gevent, but that is not significant) that has to write some confidential information in log. The obvious idea is to encrypt the confidential information using a public key that is hard-coded into my application. To read it, one would need a private key, and 2048-bit RSA seems to be safe enough. I have chosen pycrypto (tried M2Crypto as well, but found nearly no differences for my purpose) and implemented log encryption as a logging.Formatter
subclass. However, I'm new to pycrypto and cryptoraphy, and I am not sure my choice of the way my data is encrypted is reasonable. Is PKCS1_OAEP
module what I need? Or there are more friendly ways of encryption without dividing the data in small chunks?
So, what I did is:
import logging
import sys
from Crypto.Cipher import PKCS1_OAEP as pkcs1
from Crypto.PublicKey import RSA
PUBLIC_KEY = """ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDe2mtK03UhymB+SrIbJJUwCPhWNMl8/gA9d7jex0ciSuFfShDaqJ4wYWG4OOl\
VqKMxPrPcZ/PMSwtc021yI8TXfgewb65H/YQw4JzzGANq2+mFT8jWRDn+xUc6vcWnXIG3OPg5DvIipGQvIPNIUUP3qE7yDHnS5xdVdFrVe2bUUXmZJ9\
0xJpyqlTuRtIgfIfEQC9cggrdr1G50tXdXZjS0M1WXl5P6599oH/ykjpDFrCnh5fz9WDwUc0mNJ+11Qh+yfDp3k7AhzhRaROKLVWnfkklFaFm7LsdVX\
KPjp7dPRcTb84c2OnlIjU0ykL74Fy0K3eaPvM6TLe/K1XuD3933 pupkin@pupkin"""
PUBLIC_KEY = RSA.importKey(PUBLIC_KEY)
LOG_FORMAT = '[%(asctime)-15s - %(levelname)s: %(message)s]'
# May be more, but there is a limit.
# I suppose, the algorithm requires enough padding,
# and size of padding depends on key length.
MAX_MSG_LEN = 128
# Size of a block encoded with padding. For a 2048-bit key seems to be OK.
ENCODED_CHUNK_LEN = 256
def encode_msg(msg):
res = []
k = pkcs1.new(PUBLIC_KEY)
for i in xrange(0, len(msg), MAX_MSG_LEN):
v = k.encrypt(msg[i : i+MAX_MSG_LEN])
# There are nicer ways to make a readable line from data than using hex. However, using
# hex representation requires no extra code, so let it be hex.
res.append(v.encode('hex'))
assert len(v) == ENCODED_CHUNK_LEN
return ''.join(res)
def decode_msg(msg, private_key):
msg = msg.decode('hex')
res = []
k = pkcs1.new(private_key)
for i in xrange(0, len(msg), ENCODED_CHUNK_LEN):
res.append(k.decrypt(msg[i : i+ENCODED_CHUNK_LEN]))
return ''.join(res)
class CryptoFormatter(logging.Formatter):
NOT_SECRET = ('CRITICAL',)
def format(self, record):
"""
If needed, I may encode only certain types of messages.
"""
try:
msg = logging.Formatter.format(self, record)
if not record.levelname in self.NOT_SECRET:
msg = encode_msg(logging.Formatter.format(self, record))
return msg
except:
import traceback
return traceback.format_exc()
def decrypt_file(key_fname, data_fname):
"""
The function decrypts logs and never runs on server. In fact,
server does not have a private key at all. The only key owner
is server admin.
"""
res = ''
with open(key_fname, 'r') as kf:
pkey = RSA.importKey(kf.read())
with open(data_fname, 'r') as f:
for l in f:
l = l.strip()
if l:
try:
res += decode_msg(l, pkey) + '\n'
except Exception: # A line may be unencrypted
res += l + '\n'
return res
# Unfortunately dictConfig() does not support altering formatter class.
# Anyway, in demo code I am not going to use dictConfig().
logger = logging.getLogger()
handler = logging.StreamHandler(sys.stderr)
handler.setFormatter(CryptoFormatter(LOG_FORMAT))
logger.handlers = []
logger.addHandler(handler)
logging.warning("This is secret")
logging.critical("This is not secret")
UPDATE: Thanks to the accepted answer below, now I see:
My solution seems to be pretty valid for now (very few log entries, no performance considerations, more or less trusted storage). Concerning security, the best thing I can do right now is not forgetting to prohibit the user who runs my daemon from writing to the .py
and .pyc
files of the program. :-) However, if the user is compromised, he still may try to attach a debugger to my daemon process, so I should also disable login for him. Pretty obvious moments, but very important ones.
Surely there are solutions being much more scalable. A very common technique is to encrypt AES keys with slow but reliable RSA, and to encrypt data with the AES that is pretty fast. Data encryption in the case is symmetric, but retrieving the AES key requires either breaking RSA, or getting it from memory when my program is running. Stream encryption with higher-level libraries and binary log file format also are a way to go, though binary log format encrypted as a stream should be very vulnerable to log corruption, even a sudden reboot due to electricity blackout may be a problem unless I do some things at a lower level (at least log rotation on each daemon start).
I changed .encode('hex')
to .encode('base64').replace('\n').replace('\r')
. Fortunately, the base64 codec works fine with no line ends. It saves some space.
Using an untrusted storage may require signing records, but that seems to be another story.
Checking if the string is encrypted based on catching exceptions is ok, since, unless the log is tampered with by a malicious user, it's base64 codec who raises an exception, not RSA decryption.
You seem to encrypt data directly with RSA. This is relatively slow, and has the problem that you can only encrypt small parts of data. Distinguishing encrypted from plaintext data based on "decryption doesn't work" is also not a very clean solution, although it will probably work. You do use OAEP, which is good. You may want to use base64 instead of hex to save space.
However, crypto is easy to get wrong. For this reason, you should always use high-level crypto libraries wherever possible. Anything where you have to specify padding schemes yourself isn't "high-level". I am not sure if you will be able to create an efficient, line-based log encryption system without resorting to rather low-level libraries, though.
If you have no reason to encrypt only individual parts of the log, consider just encrypting the entire thing.
If you are really desperate for a line-based encryption, what you could do is the following: Create a random symmetric AES key from a secure randomness source, and give it a short but unique ID. Encrypt this key with RSA, and write the result to the log file in a line prefixed with a tag, e.g. "KEY", together with the ID. For each log line, generate a random IV, encrypt the message with AES256 in CBC mode using said IV (you don't have any length limits per line now!) and write the key ID, IV and the encrypted message to the log, prefixed with a tag, e.g. "ENC". After a certain time, destroy the symmetric key and repeat (generate new one, write to log). The disadvantage of this approach is that an attacker who can recover the symmetric key from memory can read the messages encrypted with said key. The advantage is that you can use higher-level building blocks and it is much, much faster (on my CPU, you can encrypt 70,000 log lines of 1 KB per second with AES-128, but only around 3,500 chunks of max. 256 bytes with RSA2048). RSA decryption is REALLY slow, by the way (around 100 chunks per second).
Note that you have no authentication, i.e. you won't notice modifications to your logs. For this reason, I assume you trust the log storage. Otherwise, see RFC 5848.