Search code examples
pythonpandaspython-re

Python detect dates containing commas and remove comma from text file


I have a txt file that contains dates in columns like below. The comma between the day and year is making it hard to import the data into pandas using pd.read_csv(). This is contained within a text file that has other data that should be ignored, so I can't perform some action on the entire document. I need to go through the file, find the dates with this formatting, and remove the commas within the dates, leaving the commas between dates. What's a simple way to accomplish this?

May 15, 2023, May 22, 2023
August 14, 2023, August 21, 2023
November 14, 2023, November 21, 2023
February 14, 2024, February 22, 2024

Solution

  • You can also go by this approach using re.fidall() to remove the commas from dates then write the output to new file.

    import re
    
    with open('my_text_file', 'r') as infile, open('output_file', 'w') as outfile:
        file_lines = infile.readlines()
        
        for line in file_lines:
            dates = re.findall(r"\b[A-Za-z]+\s\d+,\s\d+\b", line)
            for date in dates:
                line = line.replace(date, date.replace(",", ""))
            outfile.write(line)
    

    This approach will produce the below output which I think is the output you are looking for.


    May 15 2023, May 22 2023 
    August 14 2023, August 21 2023 
    November 14 2023, November 21 2023 
    February 14 2024, February 22 2024 
    

    Test Code:

    import re
    
    file_data = """May 15, 2023, May 22, 2023
    August 14, 2023, August 21, 2023
    November 14, 2023, November 21, 2023
    February 14, 2024, February 22, 2024
    """
    
    result = ""
    file_lines = file_data.split('\n')
    
    for line in file_lines:
        dates = re.findall(r"\b[A-Za-z]+\s\d+,\s\d+\b", line)
        for date in dates:
            line = line.replace(date, date.replace(",", ""))   
        result += f"{line} \n"
            
    print(result)