Search code examples
pythonpandasmachine-learningclone

How to create a copy of an existing DataFrame(panda)?


I have just started exploring pandas. I tried applying logarithmic scaling to a Dataframe column without affecting the source Dataframe. I passed the existing DataFrame(data_source) to the DataFrame constructor thinking that it would create a copy.

data_source = pd.read_csv("abc.csv")
log_data = pd.DataFrame(data = data_source).apply(lambda x: np.log(x + 1))

I think it works properly but is it a recommended/correct way of applying scaling on a copied DataFrame ? How is it different from the 'DataFrame.copy' function?


Solution

  • pd.DataFrame(data = data_source) does not make a copy. This is documented in the docs for the copy argument to the constructor:

    copy : boolean, default False
    Copy data from inputs. Only affects DataFrame / 2d ndarray input

    This is also easily observed by trying to mutate the result:

    >>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
    >>> y = pandas.DataFrame(x)
    >>> x
       x    y
    0  1  1.0
    1  2  2.0
    2  3  3.0
    >>> y
       x    y
    0  1  1.0
    1  2  2.0
    2  3  3.0
    >>> y.iloc[0, 0] = 2
    >>> x
       x    y
    0  2  1.0
    1  2  2.0
    2  3  3.0
    

    If you want a copy, call the copy method. You don't need a copy, though. apply already returns a new dataframe, and better yet, you can call numpy.log or numpy.log1p on dataframes directly:

    >>> x = pandas.DataFrame({'x': [1, 2, 3], 'y': [1., 2., 3.]})
    >>> numpy.log1p(x)
              x         y
    0  0.693147  0.693147
    1  1.098612  1.098612
    2  1.386294  1.386294