Search code examples
pythonsqlpostgresqltext-parsing

Howto clean comments from raw sql file


I have problem with cleaning comments and empty lines from already existing sql file. The file has over 10k lines so cleaning it manually is not an option.

I have a little python script, but I have no idea how to handle comments inside multi line inserts.

Code:

f = file( 'file.sql', 'r' )
t = filter( lambda x: not x.startswith('--') \
            and not x.isspace() 
  , f.readlines() )
f.close()
t #<- here the cleaned data should be

How it should work:

This should be cleaned:

-- normal sql comment

This should stay as it is:

CREATE FUNCTION func1(a integer) RETURNS void
    LANGUAGE plpgsql
    AS $$
BEGIN
        -- comment
       [...]
END;
$$;

INSERT INTO public.texts (multilinetext) VALUES ('
and more lines here \'
-- part of text 
\'
[...]

');

Solution

  • Try the sqlparse module.

    Updated example: leaving comments inside insert values, and comments within CREATE FUNCTION blocks. You can tweak further to tune the behavior:

    import sqlparse
    from sqlparse import tokens
    
    queries = '''
    CREATE FUNCTION func1(a integer) RETURNS void
        LANGUAGE plpgsql
            AS $$
            BEGIN
                    -- comment
           END;
           $$;
    SELECT -- comment
    * FROM -- comment
    TABLE foo;
    -- comment
    INSERT INTO foo VALUES ('a -- foo bar');
    INSERT INTO foo
    VALUES ('
    a 
    -- foo bar'
    );
    
    '''
    
    IGNORE = set(['CREATE FUNCTION',])  # extend this
    
    def _filter(stmt, allow=0):
        ddl = [t for t in stmt.tokens if t.ttype in (tokens.DDL, tokens.Keyword)]
        start = ' '.join(d.value for d in ddl[:2])
        if ddl and start in IGNORE:
            allow = 1
        for tok in stmt.tokens:
            if allow or not isinstance(tok, sqlparse.sql.Comment):
                yield tok
    
    for stmt in sqlparse.split(queries):
        sql = sqlparse.parse(stmt)[0]
        print sqlparse.sql.TokenList([t for t in _filter(sql)])
    

    Output:

    CREATE FUNCTION func1(a integer) RETURNS void
        LANGUAGE plpgsql
            AS $$
            BEGIN
                    -- comment
           END;
           $$;
    
    SELECT * FROM TABLE foo;
    
    INSERT INTO foo VALUES ('a -- foo bar');
    
    INSERT INTO foo
    VALUES ('
    a
    -- foo bar'
    );