I have this kind of dataset:
import pandas as pd
import numpy as np
x = np.array([
'355395.7037',
'355369.6383',
'355367.881',
'355381.419',
'357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/202',
'357405.7897626596'])
y = np.array([
'4521429.292',
'4521430.0229',
' 4521430.1191',
'4521430.1256',
'3 13:36 4521735.552137422',
'4521512.725'])
df = pd.DataFrame({'X':x, 'Y':y})
So, sometimes, I may have strings mixed with numbers.
A Solution I thought.
If you note 357394.9D7a82te7o5fm4o9n4t3h7 print: 06/10/202
for example, there are the words Date of month
inside number 357394.97827649437
.
and at y
column : '3 13:36 4521735.552137422'
there is the 3
that came from 2023
from previous print: 06/10/202
and the time 13:36
.
I want to get rid of them in order to have only the numbers.
I may have different situations like:
'357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/2023'
13:36 4521735.552137422
for example.
Or,
'357394.9 D7a82te7o6fm4o9n4t3h7 print: ',
'06/10/2023 13:36 4521735.552137422'
If you see the numbers, for X
column for example, all numbers have 6
digits before the decimal point, so we could take for example, the first 6 digits from
'357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/202',
-> 357394 and apply decimal point and fill with the rest numbers until a white space or a word (print) exists. So, the number to take is 357394.97827649437
But the thing is that we have a string and we cannot apply for example float
or int
to process it.
For the second case, for Y
column:
'3 13:36 4521735.552137422',
I think we must search from the end and when we see a decimal point, count 7
digits (Y columns has 7 digits before decimal) and stop there.
** UPD **
If we use:
x = np.array([
'355395.7037',
'355369.6383',
'355367.881',
'355381.419',
'357394.9D7a82te7o6fm4o9n4t3h7 p',
'357405.7897626596'])
y = np.array([
'4521429.292',
'4521430.0229',
'4521430.1191',
'4521430.1256',
'rint: 06/10/2023 13:36 4521735.552137422',
'4521512.725'])
then the solution gives:
X Y
0 355395.7037 4521429.292
1 355369.6383 4521430.0229
2 355367.881 4521430.1191
3 355381.419 4521430.1256
4 357394.97827649437 06102023
5 357405.7897626596 4521512.725
where the Y
values is: 06102023
instead of 4521735.552137422
The best would be to try avoiding this format in the first place, especially if numbers get mixed with your value.
That said, you could try to get rid of letters with replace
, and then to use a regex to str.extract
the number:
out = df.apply(lambda s: s.str.replace('[^\d .]+', '', regex=True)
.str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
For a dynamic solution to specify the number of digits per column:
d = {'X': 6, 'Y': 7}
out = df.apply(lambda s: s.str.replace('[^\d .]+', '', regex=True)
.str.extract(fr'(\d{{{d[s.name]},}}(?:\.\d+)?)',
expand=False)
)
Output:
X Y
0 355395.7037 4521429.292
1 355369.6383 4521430.0229
2 355367.881 4521430.1191
3 355381.419 4521430.1256
4 357394.97827649437 4521735.552137422
5 357405.7897626596 4521512.725
Just avoid removing the date separator:
out = df.apply(lambda s: s.str.replace('[^\d ./]+', '', regex=True)
.str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
Or only remove letters:
out = df.apply(lambda s: s.str.replace('[a-zA-Z]+', '', regex=True)
.str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
Output:
X Y
0 355395.7037 4521429.292
1 355369.6383 4521430.0229
2 355367.881 4521430.1191
3 355381.419 4521430.1256
4 357394.97827649437 4521735.552137422
5 357405.7897626596 4521512.725