Search code examples
pythonpandascsvexport-to-csvlarge-files

Problem either with number of characters exceeding cell limit, or storing lists of variable length


The problem:

I have lists of genes expressed in 53 different tissues. Originally, this data was stored in a maximal array of the genes, with 'NaN' where there was no expression. I am trying to create new lists for each tissue that just have the genes expressed, as it was very inefficient to be searching through this array every time I was running my script. I have a code that finds the genes for each tissue as required, but I do not know how to store the ouptut.

I was using pandas data frame, and then converting to csv. But this does not accept lists of varying length, unless I put this list as a single item. However, then when I save the data frame to a csv, it tries to squeeze this very long list (all genes exprssed for one tissue) into a single cell. I get an error of the string length exceeding the excel character-per-cell limit.

Therefore I need a way of either dealing with this limit, or stroing my lists in a different way. I would rather just have one file for all lists.

My code:

import csv
import pandas as pd
import math
import numpy as np


#Import list of tissues:
df = pd.read_csv(r'E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
tissuedict=df.to_dict()
tissuelist = list(tissuedict.keys())[2:]

all_genes = [gene for key,gene in tissuedict['Gene Name'].items()]

data = []


for tissue in tissuelist:

    #Create array to keep track of the protein mRnaS in tissue that are not present in the network
    #initiate with first tissue, protein

    nanInd = [key for key,value in tissuedict[tissue].items() if math.isnan(value)]

    tissueExpression = np.delete(all_genes, nanInd)

    datatis = [tissue, tissueExpression.tolist()]

    print(datatis)

    data.append(datatis)


print(data)


df = pd.DataFrame(data)

df.to_csv(r'tissue_expression_data.csv')

Link to data (either one):

https://github.com/joanna-lada/gene_data/blob/master/E-MTAB-5214-query-results.tsv

https://raw.githubusercontent.com/joanna-lada/gene_data/master/E-MTAB-5214-query-results.tsv


Solution

  • IIUC you need lists of the gene names found in each tissue. This writes these lists as columns into a csv:

    import pandas as pd
    
    df = pd.read_csv('E-MTAB-5214-query-results.tsv', skiprows = [0,1,2,3], sep='\t')
    df = df.drop(columns='Gene ID').set_index('Gene Name')
    
    res = pd.DataFrame()
    for c in df.columns:
        res = pd.concat([res, pd.Series(df[c].dropna().index, name=c)], axis=1)
    
    res.to_csv('E-MTAB-5214-query-results.csv', index=False)
    

    (Writing them as rows would have been easier, but Excel can't import so many columns) Don't open the csv in Excel directly, but use a blank worksheet and import the csv (Data - External data, From text), otherwise you can't separate them into Excel columns in one run (at least in Excel 2010).

    enter image description here