I want to sort my CSV files before processing them. Here is the format for the file names:
4A085SHT_SITE12_TREE3_LAUB.csv
67E89SOIL_SITE9_TREE2_LAUB.csv
4BBA3DEND_SITE10TREE2_LAUB.csv
The files should be sorted by the integer after SITE, i.e. 9, 10, and 12 in the above example and neglect all other characters.
Currently, I am using:
for file_name in os.listdir(folder_path):
I tried:
file_names = sorted(os.listdir(folder_path), key=lambda x: int(x.split('SITE')[1].split('.')[0])) #sort by first integer after SITE (if available)
but this throws me the following error:
ValueError: invalid literal for int() with base 10: '10TREE2_LAUB'
So I need to somehow remove TREE2_LAUB also in that example.
Edit: thanks to @kosciej16 for the solution. For my case, I added a check if the file contains the phrase:
file_names = sorted(os.listdir(folder_path), key=lambda el: int(re.search(r"SITE(\d+)", el).group(1)) if re.search(r"SITE(\d+)", el) else float('inf'))
You are very close! You just need to change sorting key:
import re
file_names = sorted(l, key=lambda el: int(re.search(r"SITE(\d+)", el).group(1)))
Here we are saying we want to find "SITE" and then any amount of digits. We use brackets to give a group for that digits, so we can access it with .group(1)
It assumes that all files satisfy the pattern. If not, you can do the following:
def foo(el):
# assumes python 3.8+
if (match := re.search(r"SITE(\d+)", el)) is not None:
return int(match.group(1))
# Put such elements on the end
return float("inf")
file_names = sorted(l, key=foo)