Search code examples
pythonparsingprettytable

Parsing (reading) prettytable text tables


I couldn't find any information about reading ascii tables (prettytable-looking). I need to parse some tables that look like this:

+--------+--------+-------------------+
|               Planets               |
+--------+--------+-------------------+
| Planet | R (km) | mass (x 10^29 kg) |
+--------+--------+-------------------+
|  Sun   | 696000 |    1989100000     |
|(Solar) |        |                   |
+--------+--------+-------------------+
| Earth  |  6371  |      5973.6       |
+--------+--------+-------------------+
|  Moon  |  1737  |       73.5        |
+--------+--------+-------------------+
|  Mars  |  3390  |      641.85       |
+--------+--------+-------------------+

As you can see, this table contains heading and subtitle, but it isn't a main problem. Here how I tried to parse it (all tries was in Python, maybe exist more suitable language):

  • module petl - doesn't work, coud not read +--------+--------+-------------------+
  • np.fromtxt or smth doesnt work too, because there's no array
  • module asciitable doesn't work,
asciitable.read("sample.txt",delimiter='|',guess=False,numpy=False, quotechar="'")
InconsistentTableError: Number of header columns (1) inconsistent with data columns (9) at data line 0
Header values: ['+----------+------------------------------------------------------------+---------------+-----------------+----------------+--------------------+-----------------------+']

and of course I tried all the combinations of parametrs

  • And yes, tried simple (but not) way and used regular expressions. But in case if "row" have more than 1 line there's too hard to catch all exceptions.
  • Also tried simple split, but it unhappy way too...

I heard about numpy substitute by 0 and 1, but its too hard with my table. Please help.


Solution

  • I think you could do this using python, with a few passes you can convert this into something that suits your needs.

    table_str = """
    +--------+--------+-------------------+
    |               Planets               |
    +--------+--------+-------------------+
    | Planet | R (km) | mass (x 10^29 kg) |
    +--------+--------+-------------------+
    |  Sun   | 696000 |    1989100000     |
    |(Solar) |        |                   |
    +--------+--------+-------------------+
    | Earth  |  6371  |      5973.6       |
    +--------+--------+-------------------+
    |  Moon  |  1737  |       73.5        |
    +--------+--------+-------------------+
    |  Mars  |  3390  |      641.85       |
    +--------+--------+-------------------+
    """
    
    table_list = [
        [item.strip() for item in line.split('|') if item]  # maintain the number of columns in rows.
        for line in table_str.strip().split('\n')
        if '+-' not in line  # discard +-
    ]
    
    column_count = len(table_list[1])
    row_idx_to_remove = list()
    for row_idx, row in enumerate(table_list[2:], start=2):
        if any(not c for c in row):  # A multiline column entry
            row_idx_to_remove.append(row_idx)
            for col_idx, col in enumerate(table_list[row_idx - 1]):
                table_list[row_idx - 1][col_idx] += row[col_idx]
    
    # Remove the rows that have already been merged into previous ones.
    for idx in row_idx_to_remove:
        del table_list[idx]
    
    table_dict = {'Name': table_list[0][0], 'Column Names': table_list[1], 'Rows': table_list[2:]}
    
    print(table_dict)
    print(table_list)
    

    Output:

    {'Name': 'Planets', 'Column Names': ['Planet', 'R (km)', 'mass (x 10^29 kg)'], 'Rows': [['Sun(Solar)', '696000', '1989100000'], ['Earth', '6371', '5973.6'], ['Moon', '1737', '73.5'], ['Mars', '3390', '641.85']]}
    [['Planets'], ['Planet', 'R (km)', 'mass (x 10^29 kg)'], ['Sun(Solar)', '696000', '1989100000'], ['Earth', '6371', '5973.6'], ['Moon', '1737', '73.5'], ['Mars', '3390', '641.85']]
    

    This implementation does not handle empty columns in rows nor does it handle the case where every column in a row has more than one line.

    You could also make it use re.split, and split on lines containing +-, then you could handle any rows that would contain the edge cases of this implementation.

    Edit: Here is an implementation that uses re.split:

    import re
    
    
    def parse_ascii_table(ascii_table):
        table_re_list = re.split(r'\+[+-]+', ascii_table)
        table_list = [l.strip().replace('\n', '') for l in table_re_list if l.strip()]
        table = {'Title': table_list[0].replace('|', '').strip(),
                 'Column Names': [ch.strip() for ch in table_list[1].split('|') if ch.strip()],
                 'Rows': list()}
        for row in table_list[2:]:
            joined_row = ['' for _ in range(len(row))]
            for lines in [line for line in row.split('||')]:
                line_part = [i.strip() for i in lines.split('|') if i]
                joined_row = [i + j for i, j in zip(joined_row, line_part)]
            table['Rows'].append(joined_row)
        return table
    
    
    table_str = """
    +--------+--------+-------------------+
    |               Planets               |
    +--------+--------+-------------------+
    | Planet | R (km) | mass (x 10^29 kg) |
    +--------+--------+-------------------+
    |  Sun   | 696000 |    1989100000     |
    |(Solar) |        |                   |
    +--------+--------+-------------------+
    | Earth  |  6371  |      5973.6       |
    +--------+--------+-------------------+
    |  Moon  |  1737  |       73.5        |
    +--------+--------+-------------------+
    |  Mars  |  3390  |      641.85       |
    +--------+--------+-------------------+
    """
    print(parse_ascii_table(table_str))
    

    Output:

    {'Title': 'Planets', 'Column Names': ['Planet', 'R (km)', 'mass (x 10^29 kg)'], 'Rows': [['Sun(Solar)', '696000', '1989100000'], ['Earth', '6371', '5973.6'], ['Moon', '1737', '73.5'], ['Mars', '3390', '641.85']]}