Search code examples
rregexparsingpdftoolstext-to-column

How to parse pdf in r and then correctly convert or extract spaced/tabbed pieces of text into columns of dataframe?


I am reading a pdf in r using library(pdftools)

library(tidyverse)
library(pdftools)
library(lubridate)

pdf_rowwise <- strsplit(pdf_text("V://path//sample.pdf"), split = "\n")
class(pdf_rowwise[[1]][8:18])

output: [1] "character"

Now taking a sample from this pdf

pdf_rowwise[[1]][8:18]
 [1] "Test Name                                           Result                   Biological Ref. Int.               Unit"  
 [2] ""                                                                                                                      
 [3] "                                              100 TEST AAROGYA 2.0"                                                    
 [4] "                                              THYROID PROFILE,Serum"                                                   
 [5] "TOTAL TRI IODOTHYRONINE - T3                         0.89                    0.80-2.0                           ng/ml" 
 [6] "   (Method : CLIA)"                                                                                                    
 [7] ""                                                                                                                      
 [8] "TOTAL THYROXINE - T4                                 8.64                    6.09 - 12.23                       ug/dL" 
 [9] "   (Method : CLIA)"                                                                                                    
[10] ""                                                                                                                      
[11] "THYROID STIMULATING HORMONE - TSH                    5.660H                  0.35 - 5.50                        uIU/mL"

I have also saved above output as text file at https://raw.githubusercontent.com/johnsnow09/stackover_doubts/main/sample_pdf_text.txt

Above text or text file can be used as a source of data and from this I am trying to extract data (line No 5,8,11) as 3 or 4 columns as dataframe from this text.

Desired Output:

enter image description here

I have tried few codes below but none of them is working for me:

strsplit(pdf_rowwise[[1]][8:18], split = "\t")
pdf_rowwise[[1]][8:18] %>% as.tibble()
# this combines everything into 1 column dataframe
# below codes also doesn't work
strsplit(pdf_rowwise[[1]][8:18], split = "\t") %>% as.tibble()
strsplit(pdf_rowwise[[1]][8:18], split = "\t") %>% list2DF()
str_split_fixed(pdf_rowwise[[1]][8:18],"                         ",2)
# not giving what I expected

I am New to this sort of parsing and extraction so not sure which library & functions are best suited for this work.

UPDATE: I am also trying to use tabulapdf and have noticed \r. Could this be of any use for column separation ?

library(tabulapdf)

strsplit(tabulapdf::extract_text("V:path//sample.pdf"),'\n')

[[1]]
  [1] "100 TEST AAROGYA 2.0\r"                                                                                                                                                            
  [2] "THYROID PROFILE,Serum\r"                                                                                                                                                           
  [3] "TOTAL TRI IODOTHYRONINE - T3\r"                                                                                                                                                    
  [4] "(Method : CLIA)\r"                                                                                                                                                                 
  [5] "0.89 0.80-2.0 ng/ml\r"                                                                                                                                                             
  [6] "TOTAL THYROXINE - T4\r"                                                                                                                                                            
  [7] "(Method : CLIA)\r"                                                                                                                                                                 
  [8] "8.64 6.09 - 12.23 ug/dL\r"                                                                                                                                                         
  [9] "THYROID STIMULATING HORMONE - TSH\r"                                                                                                                                               
 [10] "(Method : CLIA)\r"                                                                                                                                                                 
 [11] "5.660H 0.35 - 5.50 uIU/mL\r"

Sample Text form:

tabulapdf::extract_text("V:path//sample.pdf")
[1] "100 TEST AAROGYA 2.0\r\nTHYROID PROFILE,Serum\r\nTOTAL TRI IODOTHYRONINE - T3\r\n(Method : CLIA)\r\n0.89 0.80-2.0 ng/ml\r\nTOTAL THYROXINE - T4\r\n(Method : CLIA)\r\n8.64 6.09 - 12.23 ug/dL\r\nTHYROID STIMULATING HORMONE - TSH\r\n(Method : CLIA)\r\n5.660H 0.35 - 5.50 uIU/mL\r\nPregnancy reference ranges for TSH\r\n1st Trimester :  0.10 - 2.50\r\n2nd Trimester : 0.20 - 3.00\r\n3rd Trimester :  0.30 - 3.00\r\nReference: Guidelines of American Thyroid Association for the Diagnosis and Management of Thyroid Disease During Pregnancy\r\nand Postpartum, Thyroid, 2011, 21; 1-46\r\nCOMMENTS:\r\nThe levels of Thyroid hormones (T3, T4 & FT3, FT4) are low in case of Primary, Secondary and Tertiary hypothyroidism and\r\nsometimes in nonthyroidal illness also.
# pdf text read results
pdf_text("V://path//sample.pdf")

output:

Test Name                                           Result                   Biological Ref. Int.               Unit\n\n                                              100 TEST AAROGYA 2.0\n                                              THYROID PROFILE,Serum\nTOTAL TRI IODOTHYRONINE - T3                         0.89                    0.80-2.0                           ng/ml\n   (Method : CLIA)\n\nTOTAL THYROXINE - T4                                 8.64                    6.09 - 12.23                       ug/dL\n   (Method : CLIA)\n\nTHYROID STIMULATING HORMONE - TSH                    5.660H                  0.35 - 5.50                        uIU/mL\n   (Method : CLIA)\n\nPregnancy reference ranges for TSH\n1st Trimester : 0.10 - 2.50\n2nd Trimester : 0.20 - 3.00\n3rd Trimester : 0.30 - 3.00\nReference: Guidelines of American Thyroid Association for the Diagnosis and Management of Thyroid Disease During Pregnancy\nand Postpartum, Thyroid, 2011, 21; 1-46\n\nCOMMENTS:\nThe levels of Thyroid hormones (T3, T4 & FT3, FT4) are low in case of Primary, Secondary and Tertiary hypothyroidism and\nsometimes in nonthyroidal illness also. Increase levels are found in Grave’s disease, Hyperthyroidism and Thyroid Hormone\nresistance. TSH levels are raised in Primary Hypothyroidism and are low in Hyperthyroidism and secondary hypothyroidism.\n\nNOTE:\nTSH levels are subject to circadian variation, reaction peak levels between 2-4 am and at a minimum between 6-10 pm. The\nvariation is of the day has influence on the measured serum TSH concentrations.\nTSH values <0.03 uIU/ml need to be clinically correlated due to presence of a rare TSH variant in some individuals.\n\n\n\n\n                                                                                                               Page 1 of 18\n"

Solution

  • Based on the input in the Note at the end, split it on 4 or more spaces to a list, extract list elements with 4 fields, paste the fields together with comma separators (since comma does not appear in the data), convert from a list to a character vector and read in using read.csv. No packages are used.

    txt |>
      strsplit("    +") |>
      Filter(f = \(x) length(x) == 4) |>
      lapply(paste, collapse = ",") |>
      do.call(what = "c") |>
      read.csv(text = _, check.names = FALSE)
    

    giving

                              Test Name Result Biological Ref. Int.   Unit
    1      TOTAL TRI IODOTHYRONINE - T3   0.89             0.80-2.0  ng/ml
    2              TOTAL THYROXINE - T4   8.64         6.09 - 12.23  ug/dL
    3 THYROID STIMULATING HORMONE - TSH 5.660H          0.35 - 5.50 uIU/mL
    

    Note

    Input used

    txt <- c("Test Name                                           Result                   Biological Ref. Int.               Unit", 
    "", "                                              100 TEST AAROGYA 2.0", 
    "                                              THYROID PROFILE,Serum", 
    "TOTAL TRI IODOTHYRONINE - T3                         0.89                    0.80-2.0                           ng/ml", 
    "   (Method : CLIA)", "", "TOTAL THYROXINE - T4                                 8.64                    6.09 - 12.23                       ug/dL", 
    "   (Method : CLIA)", "", "THYROID STIMULATING HORMONE - TSH                    5.660H                  0.35 - 5.50                        uIU/mL"
    )