I have a csv file with delimiter in text column. The number of delimiter in text column is different from row to row.
Example of csv data (delimiter is '_'): ID_GROUP_TEXT_DATE_PART 101_group_1_Some text is here_23.06.2023_1 102_group_2_Some text is _ here_23.06.2023_1 103_group_3_Some text _ is _ here_23.06.2023_1 104_group_4_Some text is here_23.06.2023_1
I would like to correctly split the text by the columns. The expected result is:
ID | GROUP | TEXT | DATE | PART |
---|---|---|---|---|
101 | group_1 | Some text is here | 23.06.2023 | 1 |
102 | group_2 | Some text is _ here | 23.06.2023 | 1 |
103 | group_3 | Some text _ is _ here | 23.06.2023 | 1 |
104 | group_4 | Some text is here | 23.06.2023 | 1 |
I would suggest writing a RegEx pattern in order to find the corresponding columns.
In your case you should create a pattern going like: Number_group_n_text_date_Number
SO the final code should be:
import re
import pandas as pd
data = """
101_group_1_Some text is here_23.06.2023_1
102_group_2_Some text is _ here_23.06.2023_1
103_group_3_Some text _ is _ here_23.06.2023_1
104_group_4_Some text is here_23.06.2023_1
"""
pattern = r"(\d+)_group_(\d+)_(.+)_(\d{2}.\d{2}.\d{4})_(\d)"
matches = re.findall(pattern, data)
df = pd.DataFrame(matches, columns=['ID', 'GROUP', 'TEXT', 'DATE', 'PART'])
print(df)