Search code examples
pythongitcommentsstring-matching

match comments with added lines to get added comment lines in a file


I would like to extract all added comment lines for a specific file. In order to do this I extract all the comments with tokenize and ast. Additionally, I would get all the added lines for this file from git show commit -- pathfile .

I am having troubles to get the added comment lines, especially if they are just empty lines. My matching code looks like this:

addedCommentLinesPerFile = []
    for commentline in parsedCommentLines:
               for line in addedLinesList:
                        if commentline == line or commentline in line:
                            try:
                                parsedCommentLines.remove(commentline)
                                addedLinesList.remove(line)
                            except ValueError:
                                continue
                            addedCommentLinesPerFile.append(commentline)

Let's say my file would like this:

def function():
+    print("hello") #prints hello
+
"""
foo

"""

So the lists would look like this:

parsedCommentLines = ["#prints hello","foo",""]
addedLinesList = ['    print("hello") #prints hello',""]

The desired output would be:

addedCommentLinesPerFile = ["#prints hello"]

But I would get:

addedCommentLinesPerFile = ["#prints hello",""]

Solution

  • commentline in line : will indeed always return True if commentline is empty, and will also work regardless of line.

    If you want to first match the lines matching exactly then try to see if existing lines are subparts of the remaining lines, you could at least write two loops

    the first one would only match if commentline == line:, the second one if commentline in line:

    you may want to check extra conditions on commentline before checking commentline in line : minimum length, non white characters ...


    If you want to check if a # one line comment sits at the end of a string, write that :

    • check if commentline starts with a #
    • check if it is a suffix : if line.endswith(commentline)

    Another approach could be to generate two files which contain only the comment lines, and compare these two files to see how comments were modified.

    On the git side of things :

    • to list the files affected by commit, you can use :

      git show --format="" --name-only commit  # or --name-status
      
    • for each of the modified files, you can get :

      • the content of the file before :

        git show commit~:path/to/file
        
      • the content of the file after :

        git show commit:path/to/file
        

    From these two contents, you can use your code to extract comments, and either

    • write them to two files (say /tmp/comments.before and /tmp/comments.after) and just run diff /tmp/comments.before /tmp/comments.after
    • keep the list of comment lines in your program, and use a lib that runs a diff like algorithm on two strings lists