Search code examples
pythonstringgitsplit

Regex to split Git log with Python


I want to use python import re to split the string of Git's log as below:

commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <[email protected]>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <[email protected]>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

commit 4f113912741f753c75a44f18790ff5903e910fad
Author: ISAAC.NEWTON <[email protected]>
Date:   Fri Apr 14 17:55:55 2023 +0800

    Add test files

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <[email protected]>
Date:   Wed Apr 19 11:04:04 2023 +0800

    Second commit test

commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <[email protected]>
Date:   Wed Apr 19 11:04:04 2023 +0800

    First commit

Then,

I want to get commits array as below:

[
'
commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
Author: ISAAC.NEWTON <[email protected]>
Date:   Fri Apr 28 18:58:00 2023 +0800

    new cat

',
'
commit 9274b33435238122c8d6d389e73266f6a3e68745
Author: ISAAC.NEWTON <[email protected]>
Date:   Wed Apr 19 11:04:04 2023 +0800

    meow

',
...
]

It's hard to find the pattern which is Clean and General to match a commit for me.


Solution

  • Frame the problem as locating blocks with known start/end patterns.

    Then, define where the block start and end - here by anchoring to commit hashes.

    import re
    
    rgx = r'(commit\s[0-9,a-f]{40}.*?)(?=commit\s[0-9,a-f]{40}|\Z)'
    
    text = '''commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6
    Author: ISAAC.NEWTON <[email protected]>
    Date:   Fri Apr 28 18:58:00 2023 +0800
    
        new cat
    
    commit 9274b33435238122c8d6d389e73266f6a3e68745
    Author: ISAAC.NEWTON <[email protected]>
    Date:   Wed Apr 19 11:04:04 2023 +0800
    
        meow
    
    commit 4f113912741f753c75a44f18790ff5903e910fad
    Author: ISAAC.NEWTON <[email protected]>
    Date:   Fri Apr 14 17:55:55 2023 +0800
    
        Add test files
    
    commit 87053deb6ad07fa1ea6dd7a5acfee075ce5b6322
    Author: ISAAC.NEWTON <[email protected]>
    Date:   Fri Apr 14 15:16:57 2023 +0800
    
        Add cat.jpg
    '''
    
    re.findall(rgx, text, re.DOTALL)
    

    which gives the expected output

    ['commit 8e018dbcdbff15c3fc9ef4460b4214f47f71ddf6\nAuthor: ISAAC.NEWTON <[email protected]>\nDate:   Fri Apr 28 18:58:00 2023 +0800\n\n    new cat\n\n',
     'commit 9274b33435238122c8d6d389e73266f6a3e68745\nAuthor: ISAAC.NEWTON <[email protected]>\nDate:   Wed Apr 19 11:04:04 2023 +0800\n\n    meow\n\n',
     'commit 4f113912741f753c75a44f18790ff5903e910fad\nAuthor: ISAAC.NEWTON <[email protected]>\nDate:   Fri Apr 14 17:55:55 2023 +0800\n\n    Add test files\n\n',
     'commit 87053deb6ad07fa1ea6dd7a5acfee075ce5b6322\nAuthor: ISAAC.NEWTON <[email protected]>\nDate:   Fri Apr 14 15:16:57 2023 +0800\n\n    Add cat.jpg\n']
    

    EDIT: mind the EOF handled with the sentinel \Z