I am getting a strange behavior that I have never seen before from my code that I have used many times before on smaller datasets. I am parsing VCF files with Pandas dataframe read_table. VCF files have a header and then 9 columns + however many columns of individuals. Before when I used for row in genomes_df.itertuples():
to iterate through each row of the dataframe I could call a column, "SVLEN", with row.SVLEN
. When I check type(row)
it is a Pandas object. Today I ran my script on a larger file (350 columns vs 10 columns previously) of same VCF format, it is giving me AttributeError: 'tuple' object has no attribute 'SVLEN'
because now type(row)
is a tuple!
What is going on here? The column names are different (NWD107911.mark_dupes
vs NWD107911
) but I checked that there are no spaces in the names (read in another post that it could cause different behavior).
It's mentioned in the iterttuples
documentation:
With a large number of columns (>255), regular tuples are returned.
and you can see in the source code here:
# Python 3 supports at most 255 arguments to constructor, and
# things get slow with this many fields in Python 2
if name is not None and len(self.columns) + index < 256:
# `rename` is unsupported in Python 2.6
try:
itertuple = collections.namedtuple(name,
fields + list(self.columns),
rename=True)
return map(itertuple._make, zip(*arrays))
except Exception:
pass
Note: This restriction of 255 arguments to a cpython call/namedtuples has been fixed in python 3.7, so potentially this could be changed in future versions of pandas (running on python 3.7+).