Search code examples
pythonarrayspandasdata-cleaning

Strip strings and date, time from mixed string with numbers


I have this kind of dataset:

import pandas as pd
import numpy as np

x = np.array([
            '355395.7037', 
            '355369.6383', 
            '355367.881', 
            '355381.419',
            '357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/202',
            '357405.7897626596'])

y = np.array([
            '4521429.292', 
            '4521430.0229', 
            ' 4521430.1191', 
            '4521430.1256',
            '3 13:36 4521735.552137422',
            '4521512.725'])

df = pd.DataFrame({'X':x, 'Y':y})

So, sometimes, I may have strings mixed with numbers.

A Solution I thought.

If you note 357394.9D7a82te7o5fm4o9n4t3h7 print: 06/10/202 for example, there are the words Date of month inside number 357394.97827649437.

and at y column : '3 13:36 4521735.552137422' there is the 3 that came from 2023 from previous print: 06/10/202 and the time 13:36.

I want to get rid of them in order to have only the numbers.

I may have different situations like:

'357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/2023'

13:36 4521735.552137422 for example.

Or,

'357394.9 D7a82te7o6fm4o9n4t3h7 print: ',

'06/10/2023 13:36 4521735.552137422'

If you see the numbers, for X column for example, all numbers have 6 digits before the decimal point, so we could take for example, the first 6 digits from

'357394.9D7a82te7o6fm4o9n4t3h7 print: 06/10/202', -> 357394 and apply decimal point and fill with the rest numbers until a white space or a word (print) exists. So, the number to take is 357394.97827649437

But the thing is that we have a string and we cannot apply for example float or int to process it.

For the second case, for Y column:

'3 13:36 4521735.552137422',

I think we must search from the end and when we see a decimal point, count 7 digits (Y columns has 7 digits before decimal) and stop there.

** UPD **

If we use:

x = np.array([
            '355395.7037', 
            '355369.6383', 
            '355367.881', 
            '355381.419',
            '357394.9D7a82te7o6fm4o9n4t3h7 p',
            '357405.7897626596'])

y = np.array([
            '4521429.292', 
            '4521430.0229', 
            '4521430.1191', 
            '4521430.1256',
            'rint: 06/10/2023 13:36 4521735.552137422',
            '4521512.725'])

then the solution gives:

                    X             Y
0         355395.7037   4521429.292
1         355369.6383  4521430.0229
2          355367.881  4521430.1191
3          355381.419  4521430.1256
4  357394.97827649437      06102023
5   357405.7897626596   4521512.725

where the Y values is: 06102023 instead of 4521735.552137422


Solution

  • The best would be to try avoiding this format in the first place, especially if numbers get mixed with your value.

    That said, you could try to get rid of letters with replace, and then to use a regex to str.extract the number:

    out = df.apply(lambda s: s.str.replace('[^\d .]+', '', regex=True)
                              .str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
    

    For a dynamic solution to specify the number of digits per column:

    d = {'X': 6, 'Y': 7}
    
    out = df.apply(lambda s: s.str.replace('[^\d .]+', '', regex=True)
                              .str.extract(fr'(\d{{{d[s.name]},}}(?:\.\d+)?)',
                                           expand=False)
                  )
    

    Output:

                        X                  Y
    0         355395.7037        4521429.292
    1         355369.6383       4521430.0229
    2          355367.881       4521430.1191
    3          355381.419       4521430.1256
    4  357394.97827649437  4521735.552137422
    5   357405.7897626596        4521512.725
    

    updated example:

    Just avoid removing the date separator:

    out = df.apply(lambda s: s.str.replace('[^\d ./]+', '', regex=True)
                              .str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
    
    

    Or only remove letters:

    out = df.apply(lambda s: s.str.replace('[a-zA-Z]+', '', regex=True)
                              .str.extract(r'(\d{6,}(?:\.\d+)?)', expand=False))
    

    Output:

                        X                  Y
    0         355395.7037        4521429.292
    1         355369.6383       4521430.0229
    2          355367.881       4521430.1191
    3          355381.419       4521430.1256
    4  357394.97827649437  4521735.552137422
    5   357405.7897626596        4521512.725