Search code examples
pythoncsvtext

How to replace all characters before and after specific place in text in Python


I have a csv file with delimiter in text column. The number of delimiter in text column is different from row to row.

Example of csv data (delimiter is '_'): ID_GROUP_TEXT_DATE_PART 101_group_1_Some text is here_23.06.2023_1 102_group_2_Some text is _ here_23.06.2023_1 103_group_3_Some text _ is _ here_23.06.2023_1 104_group_4_Some text is here_23.06.2023_1

I would like to correctly split the text by the columns. The expected result is:

ID GROUP TEXT DATE PART
101 group_1 Some text is here 23.06.2023 1
102 group_2 Some text is _ here 23.06.2023 1
103 group_3 Some text _ is _ here 23.06.2023 1
104 group_4 Some text is here 23.06.2023 1

Solution

  • I would suggest writing a RegEx pattern in order to find the corresponding columns.

    In your case you should create a pattern going like: Number_group_n_text_date_Number

    SO the final code should be:

    import re
    import pandas as pd
    
    data = """
    101_group_1_Some text is here_23.06.2023_1
    102_group_2_Some text is _ here_23.06.2023_1
    103_group_3_Some text _ is _ here_23.06.2023_1
    104_group_4_Some text is here_23.06.2023_1
    """
    
    pattern = r"(\d+)_group_(\d+)_(.+)_(\d{2}.\d{2}.\d{4})_(\d)"
    
    matches = re.findall(pattern, data)
    
    df = pd.DataFrame(matches, columns=['ID', 'GROUP', 'TEXT', 'DATE', 'PART'])
    
    print(df)