What is the best way to copy a table that contains different delimeters, spaces in column names etc. The function pd.read_clipboard()
cannot manage this task on its own.
Example 1:
| Age Category | A | B | C | D |
|--------------|---|----|----|---|
| 21-26 | 2 | 2 | 4 | 1 |
| 26-31 | 7 | 11 | 12 | 5 |
| 31-36 | 3 | 5 | 5 | 2 |
| 36-41 | 2 | 4 | 1 | 7 |
| 41-46 | 0 | 1 | 3 | 2 |
| 46-51 | 0 | 0 | 2 | 3 |
Expected result:
Age Category A B C D
21-26 2 2 4 1
26-31 7 11 12 5
31-36 3 5 5 2
36-41 2 4 1 7
41-46 0 1 3 2
46-51 0 0 2 3
EDIT:
Example 2:
+---+---------+--------+
| id|firstName|lastName|
+---+---------+--------+
| 1| Mark| Brown|
| 2| Tom|Anderson|
| 3| Joshua|Peterson|
+---+---------+--------+
Expected result:
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson
I look for a universal approach that can be applied to the most common table types.
One option is to bite the bullet and just preprocess your data. This isn't all that bad, there's only so many cases pd.read_csv
can handle in its arguments, and if you want to be exhaustive with the cases you handle you'll eventually end up turning to regex.
To handle most of the common cases of prettyprinted tables, I'd just write a loop to filter out/replace characters in lines, then read in the output using a relatively simpler read_csv
call.
import os
def load(filename):
with open(filename) as fin, open('temp.txt', 'w') as fout:
for line in fin:
if not line.strip()[:2] in {'|-', '+-'}: # filter step
fout.write(line.strip().strip('|').replace('|', ',')+'\n')
df = pd.read_csv('temp.txt', sep=r'\s*,\s*', engine='python')
os.unlink('temp.txt') # cleanup
return df
df1 = load('data1.txt')
df2 = load('data2.txt')
df1
Age Category A B C
0 21-26 2 2 4
1 26-31 7 11 12
2 31-36 3 5 5
3 36-41 2 4 1
4 41-46 0 1 3
5 46-51 0 0 2
df2
id firstName lastName
0 1 Mark Brown
1 2 Tom Anderson
2 3 Joshua Peterson