Search code examples
pythonshlex

How to use a specific stop-character for shlex.split?


How to tell to shlex that if the character ; is found, then, don't split anything anymore?

Example:

shlex.split("""hello "column number 2" foo ; bar baz""")  

should give

["hello", "column number 2", "foo", "; bar baz"]

instead of ["hello", "column number 2", "foo", ";", "bar", "baz"].


More generally, is there a way to define "comment" separators with shlex? i.e.

shlex.split("""hello "column number 2" foo ;this is a comment; "last one" bye """)  

should give

["hello", "column number 2", "foo", ";this is a comment;", "last one", "bye"]

Solution

  • The shlex parser provides an option for specifying the comment character(s), but it's not available from the simplified shlex.split interface. Example:

    import shlex
    
    a = 'hello "bla bla" ; this is a comment'
    
    lex = shlex.shlex(a, posix=True)
    lex.commenters = ';'
    print(list(lex))  # ['hello', 'bla bla']
    

    Here is a slightly expanded split function, mostly copied from the Python standard library, with a slight modification to the comments parameter, allowing the specification of comment characters:

    import shlex
    def shlex_split(s, comments='', posix=True):
        """Split the string *s* using shell-like syntax."""
        if s is None:
            import warnings
            warnings.warn("Passing None for 's' to shlex.split() is deprecated.",
                          DeprecationWarning, stacklevel=2)
        lex = shlex.shlex(s, posix=posix)
        lex.whitespace_split = True
        if isinstance(comments, str):
            lex.commenters = comments
        elif not comments:
            lex.commenters = ''
        return list(lex)
    

    You might want to change the default value of comments in the above code; as written, it has the same default as shlex.split, which is not to recognise comments at all. (The parser objects created by shlex.shlex default to # as the comment character, which is what you get if you specify comments=True. I preserved this behaviour for compatibility.)

    Note that comments are ignored; they do not appear in the result vector at all. When the parser hits a comment character, it just stops parsing. (So there can never be two comments.) The comments string is a list of possible comments characters, not a comment sequence. So if you want to recognise both # and ; as comment characters, specify comments='#:'.

    Here's a sample run:

    >>> # Default behaviour is the same as shlex.split
    >>> shlex_split("""hello "column number 2" foo ; bar baz""") 
    ['hello', 'column number 2', 'foo', ';', 'bar', 'baz']
    >>> # Supply a comments parameter to specify a comment character 
    >>> shlex_split("""hello "column number 2" foo ; bar baz""", comments=';') 
    ['hello', 'column number 2', 'foo']
    >>> shlex_split("""hello "column number 2" foo ;this is a comment; "last one" bye """, comments=';')
    ['hello', 'column number 2', 'foo']
    >>> # The ; is recognised as a comment even if it is not preceded by whitespace.
    >>> shlex_split("""hello "column number 2" foo;this is a comment; "last one" bye """, comments=';')
    ['hello', 'column number 2', 'foo']