I have this file.
If I read it and do some clean up with regex:
import pandas as pd
import re
df = pd.read_csv('data.csv', index_col=[0])
out = df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)',
expand=False,flags=re.I).str.replace(r'[^\d.]+', '', regex=True))
out.index+=1
out
I can see these results:
X Y
2 NaN NaN
3 NaN NaN
4 573456.81 3887265.85
5 573453.26 NaN
6 573450.98 NaN
7 NaN NaN
8 NaN NaN
9 573445.167 3887284.597
10 NaN NaN
11 NaN NaN
12 573703.6758759461 3887233.5301764384
....
which contains NaN
values instead of applying the cleaning.
The weird is that the cleaning works because If I just copy the contents of the dataframe I loaded into a list:
thelist = [['1', '573436.862', '3887259.269'],
['2', '573436.031', '3887248.472'],
['3', '573456.81', '3887265.85'],
['4', '573453.26', '3887273.017'],
['5', '573450.98', '3887275.878'],
['6', '573451.611', '3887276.346'],
['7', '573446.959', '3887285.738'],
['8', 'H5m7er3o4m45h.n1i6a7 print: 19/02/202', '4 15:24 3887284.597'],
['9', '573440.184', '3887292.487'],
['10', '573436.862', '3887259.269'],
['1', '573703.6758759461', '3887233.5301764384'],
['2', '573703.7165950707', '3887233.6523487056'],
['3', '573703.769', '3887233.809'],
['4', '573707.305', '3887241.398'],
['5', '573712.9489897437', '3887251.2139821625'],
['6', '57mro3m71hn2ia.949print: 19/02/2024', '15:22 3887251.2139999997'],
['7', '573712.981495283', '3887251.2813396226'],
['8', '573713H.0m3e2romhnia print: 19/0', '2/2024 15:24 3887251.386'],
['9', '573713.096', '3887251.567'],
['10', '573713.0960000466', '3887251.5670001707'],
['11', '573713.266443923', '3887252.1920684506'],
['12', '573725.815', '3887254.127'],
['13', '573733.267', '3887255.275'],
['14', '573736.197', '3887240.846'],
['15', '573742.399', '3887229.682'],
['16', '573701.647', '3887220.061'],
['17', '573703.6758759461', '3887233.5301764384']]
and apply the cleaning:
arr = np.hstack(thelist)
arr = arr.reshape(arr.shape[0] // 3, 3)
new_df = pd.DataFrame(arr, columns=['K', 'X', 'Y'])
out = new_df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)',
expand=False, flags=re.I)
.str.replace(r'[^\d.]+', '', regex=True))
I receive the correct results!
X Y
0 573436.862 3887259.269
1 573436.031 3887248.472
2 573456.81 3887265.85
3 573453.26 3887273.017
4 573450.98 3887275.878
5 573451.611 3887276.346
6 573446.959 3887285.738
7 573445.167 3887284.597
8 573440.184 3887292.487
9 573436.862 3887259.269
10 573703.6758759461 3887233.5301764384
11 573703.7165950707 3887233.6523487056
12 573703.769 3887233.809
13 573707.305 3887241.398
....
That's normal if you see different outputs between the two approaches because you're simply not using the same inputs. In the example below, we can see that some rows in the csv (unlike the list) don't contain a dot (.
) and in your regex pattern you ask for a \.
, which leads to NaN
values in the first approach.
0,1,573436862,3887259269 # << first record of the .csv
['1', '573436.862', '3887259.269'] # << first record of the list