Search code examples
pythonpandasdataframeconcatenationread-csv

Gathering multiple text files containing multiple tables to a Pandas dataframe


I'm trying to gather multiple csv files from one folder into a dataframe. With this prior question we realized the real issue is that some csv files (summary files) contain more than one table. As a result, the current solution's product (code below) skips a significant portion of the data.

Is there any reasonable way to gather multiple files, each possibly containing multiple tables?

Alternatively, if this makes it easier, I have, and could use, separate text files for each of the tables contained in the larger summary files.

Anyhow, what I seek, is that a single row of the generated dataframe should contain the data from the three separate text files / three tables inside the summary file.

Here is my code for just adding the text files from their folder.

import pandas as pd 
import os
import glob

#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"

#create a df list using list comprehension
files = [pd.read_csv(file, sep='\t', on_bad_lines='skip') for file in glob.glob(os.path.join(files_folder,"*txt"))] 

#concatanate the list of df's into one df
files_df = pd.concat(files)


print(files_df)

Here is an example file

Sample Summary:
Sample ID: 18RF0375-MI-TBWGS-NextSeq-2019-63_S54
Sample Name: 18RF0375
Coverage Drop: 0
Pipeline Version: Var
Date: 1/12/2023 9:55
Target Coverage Summary:
Gene Name Start End Flag
gyrB 6571 6762 No deletion
gyrA 7360 7583 No deletion
rpoB 760307 761286 No deletion
mmpR 778989 779487 No deletion
rplC1 800808 801462 No deletion
atpE 1461044 1461290 No deletion
rrs 1473245 1473331 No deletion
rrl 1473657 1476796 No deletion
fabG1 1673409 1674052 No deletion
inhA 1674201 1675012 No deletion
tlyA 1917939 1918747 No deletion
katG 2153888 2156112 No deletion
pncA 2288676 2289272 No deletion
eis 2714123 2715372 No deletion
ahpC 2726093 2726194 No deletion
pepQ 2859299 2860418 No deletion
embB 4246586 4249653 No deletion
ethA 4326003 4327474 No deletion
Variant Summary:
POS Gene Name Nucleotide Change Amino acid Change Read Depth Percent Alt Allele Annotation
1474639 rrl c.982G>A NA 109 98.9 rRNA
1476311 rrl c.2654_2655delGTGinsCCA NA 100 11.7 rRNA
1476369 rrl c.2712C>T NA 110 10 rRNA
1476463 rrl c.2806C>T NA 84 10.7 rRNA
1476481 rrl c.2824T>C NA 100 13 rRNA
1476506 rrl c.2849T>C NA 92 12.7 rRNA
1673425 fabG1 upstream c.-15C>T NA 90 98.6 Non-Coding
2154724 katG c.1388G>T p.Arg463Leu 70 98.2 Non-synonymous
4247646 embB c.1133A>C p.Glu378Ala 110 98.8 Non-synonymous
Interpretations Summary:
Drug Variant Interpretation
INH fabG1 upstream_c.-15C>T,katG_p.Arg463Leu INH-R
RIF No reportable variant detected RIF-S
PZA No reportable variant detected PZA-S
FQ No reportable variant detected FQ-S
EMB embB_p.Glu378Ala EMB-S

Solution

  • New answer using the posted tables. I have pasted your input in a notepad, saved it as csv and tuned my previous answer with it.


    There are 2 strategies, right?

    • Either open your file and run a pre-screening, documenting positions of known table delimiters, then run as many pd.read_csv as needed, into a list, using a comprehension,
    • Or read the whole file into a dataframe and work from there to split the tables.

    In both cases, the empty rows looks like usable table delimiters.

    Here is the way of pandas:


    Step 1. Read your csv file with an excess of columns. You know your largest table's size, right?

    df_read = pd.read_csv(file,                      # your example, pasted back into a csv
                          sep='\t',                  # tabs for you
                          names=range(20),           # columns in excess. 
                          header=None,               # skip no line on top
                          skip_blank_lines=False     # skip no empty line -we'll use those
                         ).dropna(how='all', axis=1) # Drop excess columns
    

    Used these:

    Step 2. Use the empty rows as table delimiters

    # Flag empty rows
    nullsearch = df_read.isnull().T.all() 
    # Assign group index, and add it as an extra column
    df_read['group'] = nullsearch.diff().ne(0).where(~nullsearch).astype(float).cumsum().fillna(0)
    

    Used these:

    Step 3. Split the dataframe into a list containing all detected tables

    df_list = [df_read[df_read.group==i]
               .dropna(how='all', axis=1)
               .drop(columns='group')
               for i in df_read.group.unique() if i]
    

    There are your four tables, three plus the file header, accessible as

    df_list[0] # your file header
    df_list[1]
    df_list[2]
    df_list[3]
    

    File header:

                       0                                      1
    0         Sample ID:  18RF0375-MI-TBWGS-NextSeq-2019-63_S54
    1       Sample Name:                               18RF0375
    2     Coverage Drop:                                      0
    3  Pipeline Version:                                    Var
    4              Date:                       01/12/2023 09:55
    

    Here is for example table 2:

                       0               1                        2                  3           4                   5               6
    29  Variant Summary:             NaN                      NaN                NaN         NaN                 NaN             NaN
    30               POS       Gene Name        Nucleotide Change  Amino acid Change  Read Depth  Percent Alt Allele      Annotation
    31           1474639             rrl                 c.982G>A                NaN         109                98.9            rRNA
    32           1476311             rrl  c.2654_2655delGTGinsCCA                NaN         100                11.7            rRNA
    33           1476369             rrl                c.2712C>T                NaN         110                  10            rRNA
    34           1476463             rrl                c.2806C>T                NaN          84                10.7            rRNA
    35           1476481             rrl                c.2824T>C                NaN         100                  13            rRNA
    36           1476506             rrl                c.2849T>C                NaN          92                12.7            rRNA
    37           1673425  fabG1 upstream                 c.-15C>T                NaN          90                98.6      Non-Coding
    38           2154724            katG                c.1388G>T        p.Arg463Leu          70                98.2  Non-synonymous
    39           4247646            embB                c.1133A>C        p.Glu378Ala         110                98.8  Non-synonymous
    

    Based on what looks similar in your tables except file header, some formatting could look like this:

    def postprocess(df):
        '''
        Format listed dataframes
        - set row 1 as column names
        - drop row 0 and 1
        - reset the index
        '''
        return df.set_axis(labels=df.iloc[1,:].values, axis=1).iloc[2:,:].reset_index(drop=True)
    
    Variant_Summary = postprocess(df_list[2])
    
            POS       Gene Name        Nucleotide Change Amino acid Change Read Depth Percent Alt Allele      Annotation
    0   1474639             rrl                 c.982G>A               NaN        109               98.9            rRNA
    1   1476311             rrl  c.2654_2655delGTGinsCCA               NaN        100               11.7            rRNA
    2   1476369             rrl                c.2712C>T               NaN        110                 10            rRNA
    3   1476463             rrl                c.2806C>T               NaN         84               10.7            rRNA
    4   1476481             rrl                c.2824T>C               NaN        100                 13            rRNA
    5   1476506             rrl                c.2849T>C               NaN         92               12.7            rRNA
    6   1673425  fabG1 upstream                 c.-15C>T               NaN         90               98.6      Non-Coding
    7   2154724            katG                c.1388G>T       p.Arg463Leu         70               98.2  Non-synonymous
    8   4247646            embB                c.1133A>C       p.Glu378Ala        110               98.8  Non-synonymous