I have a bunch of files (*.pdb files generated by AlphaFold, for those who work in bio/life science) that I want to read into DataFrames in Julia.
They are typically a few 1000 lines of something like this:
MODEL 1
ATOM 1 N MET B 1 21.976 -98.993 -39.513 1.00 27.24 N
ATOM 2 CA MET B 1 23.354 -99.175 -39.063 1.00 27.24 C
ATOM 3 C MET B 1 23.767 -98.056 -38.113 1.00 27.24 C
ATOM 4 CB MET B 1 24.308 -99.226 -40.258 1.00 27.24 C
ATOM 5 O MET B 1 23.315 -96.918 -38.253 1.00 27.24 O
ATOM 6 CG MET B 1 24.512-100.624 -40.820 1.00 27.24 C
I read them to DataFrames with CSV.read using whitespace delimiter and ignorerepeated;
headers = ["Type", "Index", "Atom", "Amino acid type", "Chain", "Amino acid number", "Position X", "Position Y", "Position Z", "Something", "pIDDT", "Atom type"]
types = Dict(:"Type"=>String, :"Index"=>Int64, :"Atom"=>String, :"Amino acid type"=>String, :"Chain"=>Char, :"Amino acid number"=>Int64, :"Position X"=>Float64, :"Position Y"=>Float64, :"Position Z"=>Float64, :"Something"=>Float64, :"pIDDT"=>Float64, :"Atom type"=>Char)
df = CSV.read(file, DataFrame; header=headers, skipto=2, delim=' ', ignorerepeated=true, types=types)
The problem is with some rows, where a whitespace is "missing". In the last line in the example file above, there is no whitespace between column 7 and 8, since the value i column 8 (-100.624
) takes up the space in front of it.
This results in something like the below, where the row (now row 6) is offset with some data missing:
Row │ Type Index Atom Amino acid type Chain Amino acid number Position X Position Y Position Z Something pIDDT Atom type
│ String Int64? String? String? Char? Int64? Float64? Float64? Float64? Float64? Float64? Char?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ ATOM 1 N MET B 1 21.976 -98.993 -39.513 1.0 27.24 N
2 │ ATOM 2 CA MET B 1 23.354 -99.175 -39.063 1.0 27.24 C
3 │ ATOM 3 C MET B 1 23.767 -98.056 -38.113 1.0 27.24 C
4 │ ATOM 4 CB MET B 1 24.308 -99.226 -40.258 1.0 27.24 C
5 │ ATOM 5 O MET B 1 23.315 -96.918 -38.253 1.0 27.24 O
6 │ ATOM 6 CG MET B 1 missing -40.82 1.0 27.24 missing missing
I was thinking of pre-formatting the file (line for line, if there's a -
with no whitespace in front, add whitespace), but is there a better way?
With CSV.jl I think there is no better solution than pre-formatting. However, since your file is small you probably can just do the following (so that pre-formatting is done as a pre-processing step in RAM):
io = replace(read(file, String), r"(\d)-" => s"\1 -") |> IOBuffer
df = CSV.read(io, DataFrame; header=headers, skipto=2, delim=' ', ignorerepeated=true, types=types)