Search code examples
pythonpandasdataframelistcsv

apply to pandas columns doesn't give the correct results


I have this file.

If I read it and do some clean up with regex:

import pandas as pd
import re

df = pd.read_csv('data.csv', index_col=[0])
out = df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)', 
                                             expand=False,flags=re.I).str.replace(r'[^\d.]+', '', regex=True))

out.index+=1
out

I can see these results:

      X     Y
2   NaN     NaN
3   NaN     NaN
4   573456.81   3887265.85
5   573453.26   NaN
6   573450.98   NaN
7   NaN     NaN
8   NaN     NaN
9   573445.167  3887284.597
10  NaN     NaN
11  NaN     NaN
12  573703.6758759461   3887233.5301764384
....

which contains NaN values instead of applying the cleaning.

The weird is that the cleaning works because If I just copy the contents of the dataframe I loaded into a list:

thelist = [['1', '573436.862', '3887259.269'],
 ['2', '573436.031', '3887248.472'],
 ['3', '573456.81', '3887265.85'],
 ['4', '573453.26', '3887273.017'],
 ['5', '573450.98', '3887275.878'],
 ['6', '573451.611', '3887276.346'],
 ['7', '573446.959', '3887285.738'],
 ['8', 'H5m7er3o4m45h.n1i6a7 print: 19/02/202', '4 15:24 3887284.597'],
 ['9', '573440.184', '3887292.487'],
 ['10', '573436.862', '3887259.269'],
 ['1', '573703.6758759461', '3887233.5301764384'],
 ['2', '573703.7165950707', '3887233.6523487056'],
 ['3', '573703.769', '3887233.809'],
 ['4', '573707.305', '3887241.398'],
 ['5', '573712.9489897437', '3887251.2139821625'],
 ['6', '57mro3m71hn2ia.949print: 19/02/2024', '15:22 3887251.2139999997'],
 ['7', '573712.981495283', '3887251.2813396226'],
 ['8', '573713H.0m3e2romhnia print: 19/0', '2/2024 15:24 3887251.386'],
 ['9', '573713.096', '3887251.567'],
 ['10', '573713.0960000466', '3887251.5670001707'],
 ['11', '573713.266443923', '3887252.1920684506'],
 ['12', '573725.815', '3887254.127'],
 ['13', '573733.267', '3887255.275'],
 ['14', '573736.197', '3887240.846'],
 ['15', '573742.399', '3887229.682'],
 ['16', '573701.647', '3887220.061'],
 ['17', '573703.6758759461', '3887233.5301764384']]

and apply the cleaning:

arr = np.hstack(thelist)
arr = arr.reshape(arr.shape[0] // 3, 3)
new_df = pd.DataFrame(arr, columns=['K', 'X', 'Y'])

out = new_df[['X', 'Y']].apply(lambda s: s.str.extract(r'([a-z\d]+\.[a-z\d]+)', 
                                               expand=False, flags=re.I)
                                               .str.replace(r'[^\d.]+', '', regex=True))

I receive the correct results!

    X   Y
0   573436.862  3887259.269
1   573436.031  3887248.472
2   573456.81   3887265.85
3   573453.26   3887273.017
4   573450.98   3887275.878
5   573451.611  3887276.346
6   573446.959  3887285.738
7   573445.167  3887284.597
8   573440.184  3887292.487
9   573436.862  3887259.269
10  573703.6758759461   3887233.5301764384
11  573703.7165950707   3887233.6523487056
12  573703.769  3887233.809
13  573707.305  3887241.398
....

Solution

  • That's normal if you see different outputs between the two approaches because you're simply not using the same inputs. In the example below, we can see that some rows in the csv (unlike the list) don't contain a dot (.) and in your regex pattern you ask for a \., which leads to NaN values in the first approach.

    0,1,573436862,3887259269           # << first record of the .csv
    
    ['1', '573436.862', '3887259.269'] # << first record of the list