Search code examples
pythonpandasprotein-database

Glitch in Pandas? Cannot overwrite value


So I tried running a code I had developed previously, which has run numerous times nicely using pandas.

My dataframe has a custom index (with unique string values as the index, representing a unique identifier, in this case, individual proteins), and file names as the columns. I then use an iterative procedure to assign counts to some cells in the dataframe. So, let's say I have a default dictionary (my_dict) with a given abritrary key, and the value is [filename, protein, count].

I have a sorted list of filenames, and a sorted list of proteins, called all_filenames and all_proteins, respectively.

 import pandas as pd
 df = pd.DataFrame(index=all_proteins, columns=all_filenames)

 from collections import defaultdict
 my_dict = defaultdict(list)

 ... (Assign values to the dictionary)

 for key in my_dict:
     my_filename = my_dict[key][0]
     my_protein = my_dict[key][1]
     my_count = my_dict[key][2]

     df[my_filename][my_protein] = my_count

However, whenever I print df, it for some reason returns entirely blank in this case (with the proper index and filenames), while it doesn't normally.

So to test, I did the following on the dataframe:

>>> my_filename in df.columns.tolist()
True
>>> my_protein in df.index.tolist()
True
>>> df[my_filename][my_protein]
nan
>>> my_count
3.0
>>> type(my_count)
<type 'numpy.float64'>
>>> 
>>> df[my_filename][my_protein] = my_count
>>> df[my_filename][my_protein]
nan
>>> 

I've tried df[my_filename].ix[my_protein], df[my_filename].loc[my_protein], and even creating a custom index.

Normally this script works fine. My file names are typically something like: beta_maxi070214_08, so no spaces or not ASCII characters.

My protein names are all standard, with all the names either being in the UniProtKB database, or being linkages between two proteins (ie, ACACA-ACACB).

I'm not really sure what's going on. Does anyone have any suggestions?

EDIT: Here is an example:

>>> my_filename 
'beta_orbi080714_05'
>>> my_protein 
'ACACA:K1316-ACACA:K1363'
>>> my_count 
3.0 
>>> type(my_count) 
<type 'numpy.float64'>
>>> df[my_filename][my_protein] = my_count
>>> df[my_filename][my_protein]
nan
>>> 

Solution

  • Try: df.ix[my_filename,my_protein] = value

    The reason for this (from my understanding) is that df['x']['y'] returns a copy of the data frame. So you ARE changing a value, but you're changing the value of a copy, that's not placed back into it.

    Edit: DSM notes, .loc and .iloc are generally preferred to .ix, which has hard-to-explain semantics. And there's a section of the docs here devoted to explaining the view vs. copy issues involved http://pandas.pydata.org/pandas-docs/stable/indexing.html#returning-a-view-versus-a-copy