Search code examples
pythonpandasvcf-variant-call-format

Pandas itertuple returns inconsistent type, either Pandas or tuple


I am getting a strange behavior that I have never seen before from my code that I have used many times before on smaller datasets. I am parsing VCF files with Pandas dataframe read_table. VCF files have a header and then 9 columns + however many columns of individuals. Before when I used for row in genomes_df.itertuples(): to iterate through each row of the dataframe I could call a column, "SVLEN", with row.SVLEN. When I check type(row) it is a Pandas object. Today I ran my script on a larger file (350 columns vs 10 columns previously) of same VCF format, it is giving me AttributeError: 'tuple' object has no attribute 'SVLEN' because now type(row) is a tuple!

What is going on here? The column names are different (NWD107911.mark_dupes vs NWD107911) but I checked that there are no spaces in the names (read in another post that it could cause different behavior).


Solution

  • It's mentioned in the iterttuples documentation:

    With a large number of columns (>255), regular tuples are returned.

    and you can see in the source code here:

            # Python 3 supports at most 255 arguments to constructor, and
            # things get slow with this many fields in Python 2
            if name is not None and len(self.columns) + index < 256:
                # `rename` is unsupported in Python 2.6
                try:
                    itertuple = collections.namedtuple(name,
                                                       fields + list(self.columns),
                                                       rename=True)
                    return map(itertuple._make, zip(*arrays))
                except Exception:
                    pass
    

    Note: This restriction of 255 arguments to a cpython call/namedtuples has been fixed in python 3.7, so potentially this could be changed in future versions of pandas (running on python 3.7+).