Search code examples
pythonpandasdataframegff

Cell value to column name in pandas


I have the following pandas dataframe (it's a gff file):

df = pd.DataFrame.from_dict({'scaffold name': {0: 'Tname16C00001.1',
  1: 'Tname16C00001.1',
  2: 'Tname16C00001.1',
  3: 'Tname16C00001.1',
  4: 'Tname16C00001.1',
  5: 'Tname16C00001.1',
  6: 'Tname16C00001.1',
  7: 'Tname16C00001.1',
  8: 'Tname16C00001.1',
  9: 'Tname16C00001.1'},
 'source': {0: 'annotation',
  1: 'feature',
  2: 'feature',
  3: 'feature',
  4: 'feature',
  5: 'feature',
  6: 'feature',
  7: 'feature',
  8: 'feature',
  9: 'feature'},
 'type': {0: 'remark',
  1: 'region',
  2: 'gene',
  3: 'CDS',
  4: 'gene',
  5: 'CDS',
  6: 'gene',
  7: 'CDS',
  8: 'gene',
  9: 'CDS'},
 'start': {0: 1,
  1: 1,
  2: 943,
  3: 943,
  4: 1964,
  5: 1964,
  6: 2386,
  7: 2386,
  8: 3998,
  9: 3998},
 'stop': {0: 12018,
  1: 12018,
  2: 1992,
  3: 1992,
  4: 2179,
  5: 2179,
  6: 3897,
  7: 3897,
  8: 5152,
  9: 5152},
 'score': {0: '.',
  1: '.',
  2: '.',
  3: '.',
  4: '.',
  5: '.',
  6: '.',
  7: '.',
  8: '.',
  9: '.'},
 'strand': {0: '.',
  1: '+',
  2: '+',
  3: '+',
  4: '-',
  5: '-',
  6: '+',
  7: '+',
  8: '+',
  9: '+'},
 'phase': {0: '.',
  1: '.',
  2: '.',
  3: '0',
  4: '.',
  5: '0',
  6: '.',
  7: '0',
  8: '.',
  9: '0'},
 'attr': {0: 'accession=Tname16C00001.1;comment=Annotations were generated from the MicroScope annotation platform. Additional results are available at http://www.genoscope.cns.fr/agc/microscope . This file is not suitable for direct databank submission. To contact us: mage%40genoscope.cns.fr .%0AMicroscope genomic region coordinates: 1..12018;data_file_division=BCT;date=05-NOV-2019;organism=Genus Species Strain;source=Genus Species Strain;topology=linear',
  1: 'Is_circular=false;Note=whole genome shotgun linear WGS contig 1;db_xref=taxon:1907535,MaGe/Organism_id:12155,MaGe/Species_code:Tname16,MaGe/Sequence_id:16744,MaGe/Scaffold_id:1,MaGe/Contig_id:1,MaGe/Contig_label:NNNNNODE_1_length_11870_cov_199.017943;mol_type=genomic DNA;organism=Candidatus Thiosymbion hypermnestrae;strain=Strain',
  2: 'locus_tag=Tname16_v1_10001',
  3: 'ID=71429338;db_xref=MaGe:71429338;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10001;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MG',
  4: 'locus_tag=Tname16_v1_10002',
  5: 'ID=71429339;db_xref=MaGe:71429339;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10002;note=Evidence 5 : Unknown function;product=protein of unknown function;transl_table=11;translation=MI',
  6: 'gene=wcaJ;locus_tag=Tname16_v1_10003',
  7: 'ID=71429340;db_xref=MaGe:71429340;ec_number=2.7.8.31;gene=wcaJ;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10003;product=UDP-glucose:undecaprenyl-phosphate glucose-1-phosphate transferase;transl_table=11;translation=MY',
  8: 'gene=rffE;locus_tag=Tname16_v1_10004',
  9: 'ID=71429341;db_xref=MaGe:71429341;ec_number=5.1.3.14;function=1.6.4 : Enterobacterial common antigen %28surface glycolipid%29,6.1 : Membrane,6.3 : Surface antigens %28ECA%2C O antigen of LPS%29,7.1 : Cytoplasm;gene=rffE;inference=ab initio prediction:AMIGene:2.0;locus_tag=Tname16_v1_10004;note=Evidence 2a : Function from experimental evidences in other organisms%3B PubMedId 11106477%2C 7559340%2C 8170390%2C 8226648%3B Product type e : enzyme;product=UDP-N-acetyl glucosamine-2-epimerase;transl_table=11;translation=MT'}})

The values in the column attr are actually additional columns, but the file format gff doesn't allow that. I want to split the text in this column into multiple columns. The values are in a broader sense dictionaries, meaning that each key has a value, separated by = (e.g. accession=Tname16C00001.1), and each key-value pairs are separated by ;.

I started by splitting each key-value pair into two columns per df row:

s = df['attr'].str.split(';').apply(pd.Series, 1).stack()
s.index = s.index.droplevel(-1)
s.name = 'attr'
del df['attr']
df.join(s)
df.join(s.apply(lambda x: pd.Series(x.split('='))))

This gives me the following df, with duplicated row indeces:

    scaffold name   source  type    start   stop    score   strand  phase   0   1
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   accession   Tname16C00001.1
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   comment Annotations were generated from the MicroScope...
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   data_file_division  BCT
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   date    05-NOV-2019
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   organism    Genus Species Strain
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   source  Genus Species Strain
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   topology    linear
1   Tname16C00001.1 feature region  1   12018   .   +   .   Is_circular false
1   Tname16C00001.1 feature region  1   12018   .   +   .   Note    whole genome shotgun linear WGS contig 1
1   Tname16C00001.1 feature region  1   12018   .   +   .   db_xref taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec...
1   Tname16C00001.1 feature region  1   12018   .   +   .   mol_type    genomic DNA
1   Tname16C00001.1 feature region  1   12018   .   +   .   organism    Candidatus Thiosymbion hypermnestrae
1   Tname16C00001.1 feature region  1   12018   .   +   .   strain  Strain
2   Tname16C00001.1 feature gene    943 1992    .   +   .   locus_tag   Tname16_v1_10001
...

Now, how do I transpose and summarize each "key-value" pair in column 0 and 1, respectively, for each index? All empty cells can have NaN (that will be many).

My desired output should be:

    scaffold name   source  type    start   stop    score   strand  phase   accession   comment data_file_division  date    organism    source  topology    Is_circular Note    db_xref mol_type    organism    strain  locus_tag
0   Tname16C00001.1 annotation  remark  1   12018   .   .   .   Tname16C00001.1 Annotations were generated from the MicroScope...   BCT 05.Nov.19   Genus Species Strain    Genus Species Strain    linear  NaN NaN NaN NaN NaN NaN NaN
1   Tname16C00001.1 feature region  1   12018   .   +   .   NaN NaN NaN NaN NaN NaN NaN FALSE   whole genome shotgun linear WGS contig 1    taxon:1907535,MaGe/Organism_id:12155,MaGe/Spec...   genomic DNA Candidatus Thiosymbion hypermnestrae    Strain  NaN
2   Tname16C00001.1 feature gene    943 1992    .   +   .   NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Tname16_v1_10001
...

Solution

  • You can create list of dictionaries in list comprehension, then new DataFrame with changed columns names by DataFrame.add_prefix for avoid duplicated columns names if added to original by DataFrame.join:

    L = [dict([y.split('=') for y in x.split(';')]) for x in df.pop('attr')]
    df = df.join(pd.DataFrame(L, index=df.index).add_prefix('attr.'))
    

    print (df)
    
         scaffold name      source    type  start   stop score strand phase  \
    0  Tname16C00001.1  annotation  remark      1  12018     .      .     .   
    1  Tname16C00001.1     feature  region      1  12018     .      +     .   
    2  Tname16C00001.1     feature    gene    943   1992     .      +     .   
    3  Tname16C00001.1     feature     CDS    943   1992     .      +     0   
    4  Tname16C00001.1     feature    gene   1964   2179     .      -     .   
    5  Tname16C00001.1     feature     CDS   1964   2179     .      -     0   
    6  Tname16C00001.1     feature    gene   2386   3897     .      +     .   
    7  Tname16C00001.1     feature     CDS   2386   3897     .      +     0   
    8  Tname16C00001.1     feature    gene   3998   5152     .      +     .   
    9  Tname16C00001.1     feature     CDS   3998   5152     .      +     0   
    
        attr.accession                                       attr.comment  ...  \
    0  Tname16C00001.1  Annotations were generated from the MicroScope...  ...   
    1              NaN                                                NaN  ...   
    2              NaN                                                NaN  ...   
    3              NaN                                                NaN  ...   
    4              NaN                                                NaN  ...   
    5              NaN                                                NaN  ...   
    6              NaN                                                NaN  ...   
    7              NaN                                                NaN  ...   
    8              NaN                                                NaN  ...   
    9              NaN                                                NaN  ...   
    
         attr.locus_tag   attr.ID                    attr.inference  \
    0               NaN       NaN                               NaN   
    1               NaN       NaN                               NaN   
    2  Tname16_v1_10001       NaN                               NaN   
    3  Tname16_v1_10001  71429338  ab initio prediction:AMIGene:2.0   
    4  Tname16_v1_10002       NaN                               NaN   
    5  Tname16_v1_10002  71429339  ab initio prediction:AMIGene:2.0   
    6  Tname16_v1_10003       NaN                               NaN   
    7  Tname16_v1_10003  71429340  ab initio prediction:AMIGene:2.0   
    8  Tname16_v1_10004       NaN                               NaN   
    9  Tname16_v1_10004  71429341  ab initio prediction:AMIGene:2.0   
    
                                               attr.note  \
    0                                                NaN   
    1                                                NaN   
    2                                                NaN   
    3                      Evidence 5 : Unknown function   
    4                                                NaN   
    5                      Evidence 5 : Unknown function   
    6                                                NaN   
    7                                                NaN   
    8                                                NaN   
    9  Evidence 2a : Function from experimental evide...   
    
                                            attr.product attr.transl_table  \
    0                                                NaN               NaN   
    1                                                NaN               NaN   
    2                                                NaN               NaN   
    3                        protein of unknown function                11   
    4                                                NaN               NaN   
    5                        protein of unknown function                11   
    6                                                NaN               NaN   
    7  UDP-glucose:undecaprenyl-phosphate glucose-1-p...                11   
    8                                                NaN               NaN   
    9               UDP-N-acetyl glucosamine-2-epimerase                11   
    
      attr.translation attr.gene attr.ec_number  \
    0              NaN       NaN            NaN   
    1              NaN       NaN            NaN   
    2              NaN       NaN            NaN   
    3               MG       NaN            NaN   
    4              NaN       NaN            NaN   
    5               MI       NaN            NaN   
    6              NaN      wcaJ            NaN   
    7               MY      wcaJ       2.7.8.31   
    8              NaN      rffE            NaN   
    9               MT      rffE       5.1.3.14   
    
                                           attr.function  
    0                                                NaN  
    1                                                NaN  
    2                                                NaN  
    3                                                NaN  
    4                                                NaN  
    5                                                NaN  
    6                                                NaN  
    7                                                NaN  
    8                                                NaN  
    9  1.6.4 : Enterobacterial common antigen %28surf...  
    
    [10 rows x 30 columns]