python pandas dataframe concatenation read-csv

Gathering multiple text files containing multiple tables to a Pandas dataframe

I'm trying to gather multiple csv files from one folder into a dataframe. With this prior question we realized the real issue is that some csv files (summary files) contain more than one table. As a result, the current solution's product (code below) skips a significant portion of the data.

Is there any reasonable way to gather multiple files, each possibly containing multiple tables?

Alternatively, if this makes it easier, I have, and could use, separate text files for each of the tables contained in the larger summary files.

Anyhow, what I seek, is that a single row of the generated dataframe should contain the data from the three separate text files / three tables inside the summary file.

Here is my code for just adding the text files from their folder.

import pandas as pd 
import os
import glob

#define path to dir containing the summary text files
files_folder = "/data/TB/WA_dirty_prep_reports/"

#create a df list using list comprehension
files = [pd.read_csv(file, sep='\t', on_bad_lines='skip') for file in glob.glob(os.path.join(files_folder,"*txt"))] 

#concatanate the list of df's into one df
files_df = pd.concat(files)


print(files_df)

Here is an example file

Sample Summary:
Sample ID:	18RF0375-MI-TBWGS-NextSeq-2019-63_S54
Sample Name:	18RF0375
Coverage Drop:	0
Pipeline Version:	Var
Date:	1/12/2023 9:55


Target Coverage Summary:
Gene Name	Start	End	Flag
gyrB	6571	6762	No deletion
gyrA	7360	7583	No deletion
rpoB	760307	761286	No deletion
mmpR	778989	779487	No deletion
rplC1	800808	801462	No deletion
atpE	1461044	1461290	No deletion
rrs	1473245	1473331	No deletion
rrl	1473657	1476796	No deletion
fabG1	1673409	1674052	No deletion
inhA	1674201	1675012	No deletion
tlyA	1917939	1918747	No deletion
katG	2153888	2156112	No deletion
pncA	2288676	2289272	No deletion
eis	2714123	2715372	No deletion
ahpC	2726093	2726194	No deletion
pepQ	2859299	2860418	No deletion
embB	4246586	4249653	No deletion
ethA	4326003	4327474	No deletion


Variant Summary:
POS	Gene Name	Nucleotide Change	Amino acid Change	Read Depth	Percent Alt Allele	Annotation
1474639	rrl	c.982G>A	NA	109	98.9	rRNA
1476311	rrl	c.2654_2655delGTGinsCCA	NA	100	11.7	rRNA
1476369	rrl	c.2712C>T	NA	110	10	rRNA
1476463	rrl	c.2806C>T	NA	84	10.7	rRNA
1476481	rrl	c.2824T>C	NA	100	13	rRNA
1476506	rrl	c.2849T>C	NA	92	12.7	rRNA
1673425	fabG1 upstream	c.-15C>T	NA	90	98.6	Non-Coding
2154724	katG	c.1388G>T	p.Arg463Leu	70	98.2	Non-synonymous
4247646	embB	c.1133A>C	p.Glu378Ala	110	98.8	Non-synonymous


Interpretations Summary:
Drug	Variant	Interpretation
INH	fabG1 upstream_c.-15C>T,katG_p.Arg463Leu	INH-R
RIF	No reportable variant detected	RIF-S
PZA	No reportable variant detected	PZA-S
FQ	No reportable variant detected	FQ-S
EMB	embB_p.Glu378Ala	EMB-S

Solution

New answer using the posted tables. I have pasted your input in a notepad, saved it as csv and tuned my previous answer with it.

There are 2 strategies, right?

Either open your file and run a pre-screening, documenting positions of known table delimiters, then run as many pd.read_csv as needed, into a list, using a comprehension,
Or read the whole file into a dataframe and work from there to split the tables.

In both cases, the empty rows looks like usable table delimiters.

Here is the way of pandas:

Step 1. Read your csv file with an excess of columns. You know your largest table's size, right?

df_read = pd.read_csv(file,                      # your example, pasted back into a csv
                      sep='\t',                  # tabs for you
                      names=range(20),           # columns in excess. 
                      header=None,               # skip no line on top
                      skip_blank_lines=False     # skip no empty line -we'll use those
                     ).dropna(how='all', axis=1) # Drop excess columns

Used these:

Step 2. Use the empty rows as table delimiters

# Flag empty rows
nullsearch = df_read.isnull().T.all() 
# Assign group index, and add it as an extra column
df_read['group'] = nullsearch.diff().ne(0).where(~nullsearch).astype(float).cumsum().fillna(0)

Used these:

Step 3. Split the dataframe into a list containing all detected tables

df_list = [df_read[df_read.group==i]
           .dropna(how='all', axis=1)
           .drop(columns='group')
           for i in df_read.group.unique() if i]

There are your four tables, three plus the file header, accessible as

df_list[0] # your file header
df_list[1]
df_list[2]
df_list[3]

File header:

                   0                                      1
0         Sample ID:  18RF0375-MI-TBWGS-NextSeq-2019-63_S54
1       Sample Name:                               18RF0375
2     Coverage Drop:                                      0
3  Pipeline Version:                                    Var
4              Date:                       01/12/2023 09:55

Here is for example table 2:

                   0               1                        2                  3           4                   5               6
29  Variant Summary:             NaN                      NaN                NaN         NaN                 NaN             NaN
30               POS       Gene Name        Nucleotide Change  Amino acid Change  Read Depth  Percent Alt Allele      Annotation
31           1474639             rrl                 c.982G>A                NaN         109                98.9            rRNA
32           1476311             rrl  c.2654_2655delGTGinsCCA                NaN         100                11.7            rRNA
33           1476369             rrl                c.2712C>T                NaN         110                  10            rRNA
34           1476463             rrl                c.2806C>T                NaN          84                10.7            rRNA
35           1476481             rrl                c.2824T>C                NaN         100                  13            rRNA
36           1476506             rrl                c.2849T>C                NaN          92                12.7            rRNA
37           1673425  fabG1 upstream                 c.-15C>T                NaN          90                98.6      Non-Coding
38           2154724            katG                c.1388G>T        p.Arg463Leu          70                98.2  Non-synonymous
39           4247646            embB                c.1133A>C        p.Glu378Ala         110                98.8  Non-synonymous

Based on what looks similar in your tables except file header, some formatting could look like this:

def postprocess(df):
    '''
    Format listed dataframes
    - set row 1 as column names
    - drop row 0 and 1
    - reset the index
    '''
    return df.set_axis(labels=df.iloc[1,:].values, axis=1).iloc[2:,:].reset_index(drop=True)

Variant_Summary = postprocess(df_list[2])

        POS       Gene Name        Nucleotide Change Amino acid Change Read Depth Percent Alt Allele      Annotation
0   1474639             rrl                 c.982G>A               NaN        109               98.9            rRNA
1   1476311             rrl  c.2654_2655delGTGinsCCA               NaN        100               11.7            rRNA
2   1476369             rrl                c.2712C>T               NaN        110                 10            rRNA
3   1476463             rrl                c.2806C>T               NaN         84               10.7            rRNA
4   1476481             rrl                c.2824T>C               NaN        100                 13            rRNA
5   1476506             rrl                c.2849T>C               NaN         92               12.7            rRNA
6   1673425  fabG1 upstream                 c.-15C>T               NaN         90               98.6      Non-Coding
7   2154724            katG                c.1388G>T       p.Arg463Leu         70               98.2  Non-synonymous
8   4247646            embB                c.1133A>C       p.Glu378Ala        110               98.8  Non-synonymous