Search code examples
pythonpandasdataframedatetimeperiod

Convert a string column to period in pandas preserving the string


I would like to understand if I can convert a string column to a PeriodIndex (for instance year), preserving the string (suffix).

I have the following DataFrame:

company            date                          ...       revenue         taxes
Facebook           2017-01-01 00:00:00 Total     ...       1796.00          0.00
Facebook           2018-07-01 00:00:00 Total     ...       7423.20        -11.54
Facebook Total     -                             ...       1704.00          0.00
Google             2017-12-01 00:00:00 Total     ...       1938.60      -1938.60
Google             2018-12-01 00:00:00 Total     ...       1403.47       -102.01
Google             2018-01-01 00:00:00 Total     ...       2028.00        -76.38
Google Total       -                             ...        800.00       -256.98

I'm trying to apply the PeriodIndex to date:

df['date'] = pd.PeriodIndex(df['date'].values, freq='Y')

However, nothing happens because Pandas can't convert it to a string. I can't remove the word Total from my DataFrame.

This is what I expect to achieve:

company            date                          ...       revenue         taxes
Facebook           2017 Total                    ...       1796.00          0.00
Facebook           2018 Total                    ...       7423.20        -11.54
Facebook Total     -                             ...       1704.00          0.00
Google             2017 Total                    ...       1938.60      -1938.60
Google             2018 Total                    ...       1403.47       -102.01
Google             2018 Total                    ...       2028.00        -76.38
Google Total       -                             ...        800.00       -256.98

Any way I can get around with this?

Thanks!


Solution

  • Let's say there is a dummy dataframe, similiar with yours:

    dictionary = {'company' : ['Facebook', 'Facebook', 'Facebook_Total','Google','Google_Total'],
                  'date' : ['2019-09-14 09:00:08.279000+09:00 Total',
                           '2020-09-14 09:00:08.279000+09:00 Total',
                           '-',
                           '2021-09-14 09:00:08.279000+09:00 Total',
                           '-'],
                 'revenue' : [10,20,30,40,50]}
    df = pd.DataFrame(dictionary)
    

    I used regex module to delete Total behind the year column as following:

    substring = ' Total'
    for i in range(len(df)):
        if re.search(substring, df['date'][i] , flags=re.IGNORECASE):
            df['date'][i] = df['date'][i].replace(' Total','')
        else: pass 
    

    Then, I used pd.PeriodIndex as following:

    for i in range(len(df)) :
        if df['date'][i] == '-':
            pass
        else:
            df['date'][i] = pd.PeriodIndex(pd.Series(df['date'][i]), freq='Y')[0]
            
    for i in range(len(df)):
        if df['date'][i] == '-':
            pass
        else:
            df['date'][i] = str(df['date'][i]) + ' Total'
    

    The above code returns :

    Out[1]: 
              company        date  revenue
    0        Facebook  2019 Total       10
    1        Facebook  2020 Total       20
    2  Facebook_Total           -       30
    3          Google  2021 Total       40
    4    Google_Total           -       50