Search code examples
pythonpandasvalueerror

Getting a valueerror on merging two dataframes in Pandas


I tried to merge two dataframes using panda but this is the error code that I get:

ValueError: You are trying to merge on datetime64[ns] and datetime64[ns, UTC] columns. If you wish to proceed you should use pd.concat

I have tried different solutions found online but nothing works!! The code has been provided to me and it seems to work on other PCs but not on my computer.

This is my code:

import sys
import os
from datetime import datetime
import numpy  as np
import pandas as pd


# --------------------------------------------------------------------
# -- price, consumption and production                              --
# --------------------------------------------------------------------

fn = '../data/np_data.csv'
if os.path.isfile(fn):
    df_data = pd.read_csv(fn,header=[0],parse_dates=[0])
   
else:
    sys.exit('Could not open data file {}̈́'.format(fn))


# --------------------------------------------------------------------
# -- temp. data                                               --
# --------------------------------------------------------------------

fn = '../data/temp.csv'
if os.path.isfile(fn):
    dtemp = pd.read_csv(fn,header=[0],parse_dates=[0])
   
else:
    sys.exit('Could not open data file {}̈́'.format(fn))


# --------------------------------------------------------------------
# --  price data                                              --
# --   first date: 2014-01-13                                       --
# --   last  date: 2020-02-01                                       --
# --------------------------------------------------------------------

fn = '../data/eprice.csv'
if os.path.isfile(fn):
    eprice = pd.read_csv(fn,header=[0])
   
else:
    sys.exit('Could not open data file {}̈́'.format(fn))


# --------------------------------------------------------------------
# -- combine dataframes (and save as CSV file)                      --
# --------------------------------------------------------------------

#

df= df_data.merge(dtemp, on='time',how='left')      ## This is where I get the error.

print(df.info())
print(eprice.info())

#
# add eprice
df = df.merge(eprice, on='date', how='left')

#
# eprice available only available on trading days
#   fills in missing values, last observation is used
  df = df.fillna(method='ffill')

#
# keep only the relevant time period
df = df[df.date > '2014-01-23']
df = df[df.date < '2020-02-01']


df.to_csv('../data/my_data.csv',index=False)

The datasets that have been imported look normal with expected number of columns and observations. The version that I have in Panda is 1.0.3

Edit:

this is the output (df) when I first merge df_data and dtemp.

                           time  price_sys  price_no1  ...  temp_no3  temp_no4  temp_no5
0 2014-01-23 00:00:00+00:00      32.08      32.08  ...       NaN       NaN       NaN
1 2014-01-24 00:00:00+00:00      31.56      31.60  ...      -2.5      -8.7       2.5
2 2014-01-24 00:00:00+00:00      30.96      31.02  ...      -2.5      -8.7       2.5
3 2014-01-24 00:00:00+00:00      30.84      30.79  ...      -2.5      -8.7       2.5
4 2014-01-24 00:00:00+00:00      31.58      31.10  ...      -2.5      -8.7       2.5

[5 rows x 25 columns]

This is the output for eprice before I merge:

    <bound method NDFrame.head of                      date  gas price  oil price  coal price  carbon price
0     2014-01-24 00:00:00      66.00     107.88       79.42          6.89
1     2014-01-27 00:00:00      64.20     106.69       79.43          7.04
2     2014-01-28 00:00:00      63.75     107.41       79.29          7.20
3     2014-01-29 00:00:00      63.20     107.85       78.52          7.21
4     2014-01-30 00:00:00      62.60     107.95       78.18          7.46
                  ...        ...        ...         ...           ...
1608  2020-03-25 00:00:00      22.30      27.39       67.81         17.51
1609  2020-03-26 00:00:00      21.55      26.34       70.35         17.35
1610  2020-03-27 00:00:00      18.90      24.93       72.46         16.39
1611  2020-03-30 00:00:00      19.20      22.76       71.63         17.06
1612  2020-03-31 00:00:00      18.00      22.74       71.13         17.68

[1613 rows x 5 columns]>

This is what happends when I merge df and eprice:

    <bound method NDFrame.head of                      date  gas price  oil price  coal price  carbon price
0     2014-01-24 00:00:00      66.00     107.88       79.42          6.89
1     2014-01-27 00:00:00      64.20     106.69       79.43          7.04
2     2014-01-28 00:00:00      63.75     107.41       79.29          7.20
3     2014-01-29 00:00:00      63.20     107.85       78.52          7.21
4     2014-01-30 00:00:00      62.60     107.95       78.18          7.46
                  ...        ...        ...         ...           ...
1608  2020-03-25 00:00:00      22.30      27.39       67.81         17.51
1609  2020-03-26 00:00:00      21.55      26.34       70.35         17.35
1610  2020-03-27 00:00:00      18.90      24.93       72.46         16.39
1611  2020-03-30 00:00:00      19.20      22.76       71.63         17.06
1612  2020-03-31 00:00:00      18.00      22.74       71.13         17.68

[1613 rows x 5 columns]>
                       time  price_sys  ...  coal price  carbon price
0 2014-01-23 00:00:00+00:00      32.08  ...         NaN           NaN
1 2014-01-24 00:00:00+00:00      31.56  ...         NaN           NaN
2 2014-01-24 00:00:00+00:00      30.96  ...         NaN           NaN
3 2014-01-24 00:00:00+00:00      30.84  ...         NaN           NaN
4 2014-01-24 00:00:00+00:00      31.58  ...         NaN           NaN

[5 rows x 29 columns]

Solution

  • Try doing df['Time'] = pd.to_datetime(df['Time'], utc = True) on both the time columns before joining (or rather the one without UTC needs to go through this!)