Search code examples
pythonpandasregexscientific-notation

Converting column with various types of numerical units


I have a dataframe with a column that has various unit types

The rawNo column is how the data comes in. I would like to change it to look like the ConvNo column

datasample = pd.DataFrame(columns=['rawNo','ConvNo'])

datasample = datasample.append({'rawNo': '-4.35%','ConvNo': -.0435},ignore_index = True)
datasample = datasample.append({'rawNo': '246.6K','ConvNo': 246600},ignore_index = True)
datasample = datasample.append({'rawNo': np.nan,'ConvNo': np.nan},ignore_index = True)
datasample = datasample.append({'rawNo': '$12.76B','ConvNo': 12760000000},ignore_index = True)
datasample = datasample.append({'rawNo': '4.68%','ConvNo': .0468},ignore_index = True)
datasample = datasample.append({'rawNo': '¥-459.5B','ConvNo': -459500000000},ignore_index = True)
datasample = datasample.append({'rawNo': '€-6.8B','ConvNo': -6800000000},ignore_index = True)
datasample = datasample.append({'rawNo': '£-15.623B','ConvNo': -15623000000},ignore_index = True)
datasample = datasample.append({'rawNo': '$-1,400B','ConvNo': -15623000000},ignore_index = True)

I figure I will have to use some type of conditional apply. This apply to remove the percent is failing

def rPercent(value):
    value = str(value)
    count = value.count('%')
    print(count)
    
    if (count != 0):
        return value.str.rstrip('% ').astype('float') / 100.0
    else:
        return value

datasample["ConvNo"] = datasample['actual'].apply(rPercent)

Error I get:

> AttributeError: 'str' object has no attribute 'str'

Data File. you can download file from this link https://projectcodesamples.s3.amazonaws.com/ForEx.csv

The columns I am trying to convert is "actual" the result is in the "CNactual" column


Solution

  • This is posted as a separate answer, since I want to retain the original reproducible example (in case the linked csv is no longer available).

    A careful analysis of the data will show that the non-numeric characters that cause trouble are: commas used as thousand separators, single dash symbols (presumably indicating nan). After incorporating these into the character_mapping the conversion works without raising any errors:

    from numpy import nan
    from pandas import read_csv
    
    df = read_csv("https://projectcodesamples.s3.amazonaws.com/ForEx.csv")
    
    character_mapping = {
        "%": "e-2",
        "K": "e3",
        "M": "e6",
        "B": "e9",
        "T": "e12",
        # note that the currency symbols are mapped to blank string
        # to facilitate the conversion to float
        "£": "",
        "\$": "",
        "¥": "",
        "€": "",
        # if an entry consists of a dash, replace with nan
        "^-$": nan,
        # some numbers have a comma as a thousands separator
        ",": "",
    }
    
    df["Converted"] = df["consensus"].replace(character_mapping, regex=True).astype(float)
    
    mask = (df["Converted"] != df["consensus"]) & (df["consensus"].notna())
    print(df.loc[mask, ["consensus", "Converted"]])