I need to convert a markdown table into a pandas DataFrame. I've managed to do this using the pd.read_csv
function with '|' as the separator, but it seems like there's some additional cleanup required. Specifically, I need to remove the row containing '-----', which is used for table separation, and I also want to get rid of the last column.
Here's a simplified example of what I'm doing:
import pandas as pd
from io import StringIO
# The text containing the table
text = """
| Some Title | Some Description | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game | 5 |
| Bloodborne | This one is even better | 2 |
| Sekiro | This one is also pretty good | 110101 |
"""
# Use StringIO to create a file-like object from the text
text_file = StringIO(text)
# Read the table using pandas read_csv with '|' as the separator
df = pd.read_csv(text_file, sep='|', skipinitialspace=True)
# Remove leading/trailing whitespace from column names
df.columns = df.columns.str.strip()
# Remove the index column
df = df.iloc[:, 1:]
Is there a more elegant and efficient way to convert a markdown table into a DataFrame without needing to perform these additional cleanup steps? I'd appreciate any suggestions or insights on improving this process.
Like this
import re
import pandas as pd
text = """
| Some Title | Some Description | Some Number |
|------------|------------------------------|-------------|
| Dark Souls | This is a fun game | 5 |
| Bloodborne | This one is even better | 2 |
| Sekiro | This one is also pretty good | 110101 |
"""
pattern = r"\| ([\w\s]+) \| ([\w\s]+) \| ([\w\s]+) \|"
# Use the findall function to extract all rows that match the pattern
matches = re.findall(pattern, text)
# Extract the header and data rows
header = matches[0]
data = matches[1:]
# Create a pandas DataFrame using the extracted header and data rows
df = pd.DataFrame(data, columns=header)
# Optionally, convert numerical columns to appropriate types
df['Some Number'] = df['Some Number'].astype(int)
print(df)