I have a csv file that comes from outside, so I am not able to change its definitions or formatting and I have to read it into a database.
The file has some things that look very odd for a csv format and I would like to know if there is a way of importing it using csv.reader or pandas without having to make some specific corrections using regex or similar.
The file uses pipe (|) as delimiter. Look at this two example lines:
1. '11|2000-02-18|Badjora|22|33|44|55|""|""Wood 6.5"" Long Line (Pine - Category:15 m | Humidity max: 13 g/m3)""|4'
2. '333|1999-02-18|Badjora|444|555|666|777|""|"""|4'
Is there a way of interpreting this into the following fields?
1. 11,2000-02-18,Badjora,22,33,44,55,,"Wood 6.5"" Long Line (Pine - Category:15 m | Humidity max: 13 g/m3)",4
2. 333,1999-02-18,Badjora,444,555,666,777,,",4
Something like "consider everything enclosed between quotes".
I tried to play with the different options of import settings but with no success. It seems that this format is not valid for csv.
Thank you very much in advance.
Regards to all
Came up with this hacky regex delimiter with pandas that would work for your given example. I don't recommend doing this and advice to actually fix the csv using regex. Anyway here you go:
pd.read_csv('sample.csv', sep=r'"?(?<!\s)\|(?!\s)"?', engine='python', header=None)
There are still areas this could break for example a pipe without any spaces near it. Also an actual separator pipe having spaces before / after it. The main problem is your csv has unescaped quotes causing these problems.