I have just started exploring pandas. I tried applying logarithmic scaling to a Dataframe column without affecting the source Dataframe. I passed the existing DataFrame(data_source) to the DataFrame constructor thinking that it would create a copy.
data_source = pd.read_csv("abc.csv")
log_data = pd.DataFrame(data = data_source).apply(lambda x: np.log(x + 1))
I think it works properly but is it a recommended/correct way of applying scaling on a copied DataFrame ? How is it different from the 'DataFrame.copy' function?
pd.DataFrame(data = data_source)
does not make a copy. This is documented in the docs for the copy
argument to the constructor:
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
This is also easily observed by trying to mutate the result:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> y = pandas.DataFrame(x)
>>> x
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y
x y
0 1 1.0
1 2 2.0
2 3 3.0
>>> y.iloc[0, 0] = 2
>>> x
x y
0 2 1.0
1 2 2.0
2 3 3.0
If you want a copy, call the copy
method. You don't need a copy, though. apply
already returns a new dataframe, and better yet, you can call numpy.log
or numpy.log1p
on dataframes directly:
>>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
>>> numpy.log1p(x)
x y
0 0.693147 0.693147
1 1.098612 1.098612
2 1.386294 1.386294