Search code examples
pythonpython-re

How to sort CSV file in ascending order by integer in middle of file name


I want to sort my CSV files before processing them. Here is the format for the file names:

4A085SHT_SITE12_TREE3_LAUB.csv
67E89SOIL_SITE9_TREE2_LAUB.csv
4BBA3DEND_SITE10TREE2_LAUB.csv

The files should be sorted by the integer after SITE, i.e. 9, 10, and 12 in the above example and neglect all other characters.

Currently, I am using:

for file_name in os.listdir(folder_path):

I tried:

file_names = sorted(os.listdir(folder_path), key=lambda x: int(x.split('SITE')[1].split('.')[0])) #sort by first integer after SITE (if available)

but this throws me the following error:

ValueError: invalid literal for int() with base 10: '10TREE2_LAUB'

So I need to somehow remove TREE2_LAUB also in that example.

Edit: thanks to @kosciej16 for the solution. For my case, I added a check if the file contains the phrase:

file_names = sorted(os.listdir(folder_path), key=lambda el: int(re.search(r"SITE(\d+)", el).group(1)) if re.search(r"SITE(\d+)", el) else float('inf'))

Solution

  • You are very close! You just need to change sorting key:

    import re
    
    file_names = sorted(l, key=lambda el: int(re.search(r"SITE(\d+)", el).group(1)))
    

    Here we are saying we want to find "SITE" and then any amount of digits. We use brackets to give a group for that digits, so we can access it with .group(1)

    It assumes that all files satisfy the pattern. If not, you can do the following:

    def foo(el):
        # assumes python 3.8+
        if (match := re.search(r"SITE(\d+)", el)) is not None:
            return int(match.group(1))
        # Put such elements on the end
        return float("inf")
    
    file_names = sorted(l, key=foo)