Search code examples
pythonpandasclipboardcopy-paste

Parse prettyprinted tabular data with pandas


What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard() cannot manage this task on its own.

Example 1:

| Age Category | A | B  | C  | D |
|--------------|---|----|----|---|
| 21-26        | 2 | 2  | 4  | 1 |
| 26-31        | 7 | 11 | 12 | 5 |
| 31-36        | 3 | 5  | 5  | 2 |
| 36-41        | 2 | 4  | 1  | 7 |
| 41-46        | 0 | 1  | 3  | 2 |
| 46-51        | 0 | 0  | 2  | 3 |

Expected result:

 Age Category  A  B   C   D    
 21-26         2  2   4   1 
 26-31         7  11  12  5 
 31-36         3  5   5   2 
 36-41         2  4   1   7 
 41-46         0  1   3   2 
 46-51         0  0   2   3

EDIT:

Example 2:

+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
|  1|     Mark|   Brown|
|  2|      Tom|Anderson|
|  3|   Joshua|Peterson|
+---+---------+--------+

Expected result:

   id firstName  lastName
0   1      Mark     Brown
1   2       Tom  Anderson
2   3    Joshua  Peterson

I look for a universal approach that can be applied to the most common table types.


Solution

  • One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.

    To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv call.

    import os 
    
    def load(filename):
        with open(filename) as fin, open('temp.txt', 'w') as fout:
            for line in fin:
                if not line.strip()[:2] in {'|-', '+-'}: # filter step
                    fout.write(line.strip().strip('|').replace('|', ',')+'\n')
    
        df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
        os.unlink('temp.txt') # cleanup
    
        return df
    

    df1 = load('data1.txt')
    df2 = load('data2.txt')
    
    df1
    
      Age Category  A   B   C
    0        21-26  2   2   4
    1        26-31  7  11  12
    2        31-36  3   5   5
    3        36-41  2   4   1
    4        41-46  0   1   3
    5        46-51  0   0   2
    
    df2
    
       id firstName  lastName
    0   1      Mark     Brown
    1   2       Tom  Anderson
    2   3    Joshua  Peterson