Search code examples
pythonnlpmarkdown

How to Efficiently Convert a Markdown Table to a DataFrame in Python?


I need to convert a markdown table into a pandas DataFrame. I've managed to do this using the pd.read_csv function with '|' as the separator, but it seems like there's some additional cleanup required. Specifically, I need to remove the row containing '-----', which is used for table separation, and I also want to get rid of the last column.

Here's a simplified example of what I'm doing:

import pandas as pd
from io import StringIO

# The text containing the table
text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

# Use StringIO to create a file-like object from the text
text_file = StringIO(text)

# Read the table using pandas read_csv with '|' as the separator
df = pd.read_csv(text_file, sep='|', skipinitialspace=True)

# Remove leading/trailing whitespace from column names
df.columns = df.columns.str.strip()

# Remove the index column
df = df.iloc[:, 1:]

Is there a more elegant and efficient way to convert a markdown table into a DataFrame without needing to perform these additional cleanup steps? I'd appreciate any suggestions or insights on improving this process.


Solution

  • Like this

    import re
    import pandas as pd
    
    text = """
    | Some Title | Some Description             | Some Number |
    |------------|------------------------------|-------------|  
    | Dark Souls | This is a fun game           | 5           |
    | Bloodborne | This one is even better      | 2           |
    | Sekiro     | This one is also pretty good | 110101      |
    """
    
    pattern = r"\| ([\w\s]+) \| ([\w\s]+) \| ([\w\s]+) \|"
    
    # Use the findall function to extract all rows that match the pattern
    matches = re.findall(pattern, text)
    
    # Extract the header and data rows
    header = matches[0]
    data = matches[1:]
    
    # Create a pandas DataFrame using the extracted header and data rows
    df = pd.DataFrame(data, columns=header)
    
    # Optionally, convert numerical columns to appropriate types
    df['Some Number'] = df['Some Number'].astype(int)
    
    print(df)