Search code examples
postgresqlpgbadger

parsing postgres logs for table usage by user


I'm conducting an audit of how much existing database tables are used and by what users as part of a database cleanup effort. Using the log files seems like a natural way to get at this data. We have pgBadger running for performance reports but a usage report as I've described doesn't exist. Does anyone know of a tool (pgBadger or otherwise) that will extract table and user information from the logs so that I can calculate summary stats on it? I'd like to leverage existing tools rather than rolling my own log parser.


Solution

  • I endup writing a hacky log parser.

    import re
    import psqlparse
    import json
    
    statement_re = r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2} UTC:[^:]*:(?P<user>\w+)@(?P<db>\w+):.*statement:\s+(?P<stmt>.*)"
    log_re = r"^\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}"
    
    def parse_logs(log_file):
        with open(log_file, 'r') as f:
            state = 'looking'
            info = None
            for line in f:
                if state == 'found':
                    if re.match(log_re, line) is None:
                        info['stmt'].append(line.strip())
                    else:
                        info['stmt'] = "\n".join(info['stmt']).strip()
                        try:
                            parsed_stmt = psqlparse.parse(info['stmt'])[0]
                            info['stmt_type'] = str(type(parsed_stmt)).split(".")[-1][0:-6].lower()
                            info['tables'] = list(parsed_stmt.tables())
                        except:
                            pass
                        print(json.dumps(info))
                        state = 'looking'
                if state == 'looking':
                    m = re.match(statement_re, line)
                    if m is not None:
                        stmt = m.group('stmt')
                        if stmt not in {'BEGIN', 'COMMIT', 'ROLLBACK', 'SELECT 1'} and 'show' not in stmt:
                            info = {'user': m.group('user'), 'stmt': [stmt]}
                            state = 'found'
    
    parse_logs('postgresql.log')