Search code examples
c++pythoncregexcomments

Remove C and C++ comments using Python?


I'm looking for Python code that removes C and C++ comments from a string. (Assume the string contains an entire C source file.)

I realize that I could .match() substrings with a Regex, but that doesn't solve nesting /*, or having a // inside a /* */.

Ideally, I would prefer a non-naive implementation that properly handles awkward cases.


Solution

  • I don't know if you're familiar with sed, the UNIX-based (but Windows-available) text parsing program, but I've found a sed script here which will remove C/C++ comments from a file. It's very smart; for example, it will ignore '//' and '/*' if found in a string declaration, etc. From within Python, it can be used using the following code:

    import subprocess
    from cStringIO import StringIO
    
    input = StringIO(source_code) # source_code is a string with the source code.
    output = StringIO()
    
    process = subprocess.Popen(['sed', '/path/to/remccoms3.sed'],
        input=input, output=output)
    return_code = process.wait()
    
    stripped_code = output.getvalue()
    

    In this program, source_code is the variable holding the C/C++ source code, and eventually stripped_code will hold C/C++ code with the comments removed. Of course, if you have the file on disk, you could have the input and output variables be file handles pointing to those files (input in read-mode, output in write-mode). remccoms3.sed is the file from the above link, and it should be saved in a readable location on disk. sed is also available on Windows, and comes installed by default on most GNU/Linux distros and Mac OS X.

    This will probably be better than a pure Python solution; no need to reinvent the wheel.