How to Efficiently Convert a Markdown Table to a DataFrame in Python?

I need to convert a markdown table into a pandas DataFrame. I've managed to do this using the pd.read_csv function with '|' as the separator, but it seems like there's some additional cleanup required. Specifically, I need to remove the row containing '-----', which is used for table separation, and I also want to get rid of the last column.

Here's a simplified example of what I'm doing:

import pandas as pd
from io import StringIO

# The text containing the table
text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

# Use StringIO to create a file-like object from the text
text_file = StringIO(text)

# Read the table using pandas read_csv with '|' as the separator
df = pd.read_csv(text_file, sep='|', skipinitialspace=True)

# Remove leading/trailing whitespace from column names
df.columns = df.columns.str.strip()

# Remove the index column
df = df.iloc[:, 1:]

Is there a more elegant and efficient way to convert a markdown table into a DataFrame without needing to perform these additional cleanup steps? I'd appreciate any suggestions or insights on improving this process.

Solution

Like this

import re
import pandas as pd

text = """
| Some Title | Some Description             | Some Number |
|------------|------------------------------|-------------|  
| Dark Souls | This is a fun game           | 5           |
| Bloodborne | This one is even better      | 2           |
| Sekiro     | This one is also pretty good | 110101      |
"""

pattern = r"\| ([\w\s]+) \| ([\w\s]+) \| ([\w\s]+) \|"

# Use the findall function to extract all rows that match the pattern
matches = re.findall(pattern, text)

# Extract the header and data rows
header = matches[0]
data = matches[1:]

# Create a pandas DataFrame using the extracted header and data rows
df = pd.DataFrame(data, columns=header)

# Optionally, convert numerical columns to appropriate types
df['Some Number'] = df['Some Number'].astype(int)

print(df)