Search code examples
pythonpandascsvmarkdowndelimiter

How to delimit text stored in variable to create a dataframe in python?


I have parsed a pdf file using ai models and got parsed markdown results which is saved in a variable doc_parsed. Printing below its sample contents by code print(doc_parsed[2].text[:1000])

# Details

|Name|Mr. XYZ|
|---|---|
|Age/Sex|XX YRS/X|
|Id.|01x40xxxxx|
|Refered By|Self|
|Collection On|xx/Aug/20xx 0x:x0AM|
|Collected By|xxxxxxx|
|Sample Rec. On|xx/Aug/20xx xx:x0 AM|
|Collection Mode|HOME COLLECTION|
|Reporting On|xx/Aug/20xx 0x:xx PM|
|BarCode|xxxxxx|

# Test Results

|Test Name|Result|Biological Ref. Int.|Unit|
|---|---|---|---|
|Electrolyte Profile, Serum| | | |
|SODIUM (Na+)|136.2|136 - 145|mmol/L|
|POTASSIUM (K+)|4.23|3.5 - 5.5|mmol/L|
|CHLORIDE(Cl-)|106.24|98.0 - 107|mmol/L|
|TOTAL CALCIUM (Ca)|9.00|8.6-10.2|mg/dL|
|IONIZED CALCIUM|4.52|4.4 - 5.4|mg/dl|
|NON-IONIZED CALCIUM|4.49|4.4 - 5.4|mg/dl|
|pH.(Method : ISE Direct)|7.39|7.35 - 7.45| |

ISSUE: I have tried several ways to split this into columns of dataframe with delimeter as | by using pd.read_csv() & pd.read_table() but none worked.

import pandas as pd
import io

pd.read_table(doc_parsed[2].text[:1000], sep="|")

ValueError: Invalid file path or buffer object type: <class 'llama_index.core.schema.Document'>

import io 

input_text = io.StringIO(print(doc_parsed[2].text[:1000]))

pd.read_csv(input_text,header=None, delimiter="|", 
            usecols = ["Parameter Name", "Result","Unit","Reference Range"])   

EmptyDataError: No columns to parse from file

pd.read_csv(input_text,header=None, delimiter="|")   

EmptyDataError: No columns to parse from file

Appreciate any help here.


Solution

  • This issue might be due to the markdown not being perfectly formatted or extra characters that pandas doesn't handle well by default.

    Potential Solution You can try the following approach:

    input_text = io.StringIO(sample_text)
    
    df = pd.read_csv(input_text, sep="|", skipinitialspace=True)
    
    df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
    

    Using this method, I was able to successfully parse the markdown into a DataFrame.

    enter image description here

    I hope this gives you a clean DataFrame with your data nicely organized into columns.