I have a DataFrame in the following format:
julia> using DataFrames, CSV
julia> str = """Roll,id,name,tag,date1,date2,date3,date4,date5
0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331
0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331
0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331
0001683168-20-001021,1622879,Mohit,Teacher,20191231
0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231
"""
"Roll,id,name,tag,date1,date2,date3,date4,date5\n0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331\n0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331\n0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331\n0001683168-20-001021,1622879,Mohit,Teacher,20191231\n0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231\n"
julia> df = CSV.read(IOBuffer(str), DataFrame)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64? Int64? Missing Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
Now, first, let's talk about df
and let's talk about 1st name in :name
column which is Rohit
. So, the id
and tags
are the same for name
Rohit
but the string in Roll is different and there is some duplicates date for that particular name. So, now let's have a look at simplified DataFrame df1
and I will talk about how I manipulated it manually:
Now, the target DataFrame I want to achieve is:
julia> df1 = CSV.read(IOBuffer(str), DataFrame)
2×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 Int64 Int64? Int64? Int64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-21-000079 1443646 Rohit Student 20210331 20210331 20190331 20180331 20170331
2 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
Lets focus on some points:
:Roll
column. In this column you will see there are three different roll strings, for example, 0001443646-21-000079
now here in the between 21
is our higher precedence to consider in comparison to the other two strings 0001443646-20-000086
, and 0001443646-19-000093
where we will get 20
, and 19
respectively. So according to 0001443646-21-000079
the extracted dates will be:20190331,20210331,20200331,missing,missing
But now I will make a decreasing order again (means we will put recent years first followed by past years), for examples,
20210331,20200331,20190331,missing,missing
same for the other two strings:
0001443646-19-000093,1443646,Rohit,Student,20190331,20180331,20170331,missing,missing
0001443646-20-000086,1443646,Rohit,Student,20200331,20190331,20180331,missing,missing
Now, as we will consider higher precedence Roll first so if we will encounter duplicates from other two roll string we will not consider it. The final format for name Rohit
will look like this:
0001443646-21-000079,1443646,Rohit,Student,20210331,20210331,20190331,20180331,20170331
If you will see in other two Roll strings we are having duplicates. Now, I want to convert the whole idea into the script format. I am having some difficulties with it. Can someone will help me?
I assume you are using DataFrames.jl 1.2.2 and CSV.jl 0.9.2 (current versions).
Then do:
julia> df = CSV.read(IOBuffer(str), DataFrame, stringtype=String)
5×9 DataFrame
Row │ Roll id name tag date1 date2 date3 date4 date5
│ String Int64 String String Int64 String String String String
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
1 │ 0001443646-19-000093 1443646 Rohit Student 20170331 20180331 20190331 missing missing
2 │ 0001443646-20-000086 1443646 Rohit Student 20190331 20200331 20180331 missing missing
3 │ 0001443646-21-000079 1443646 Rohit Student 20190331 20210331 20200331 missing missing
4 │ 0001683168-20-001021 1622879 Mohit Teacher 20191231 missing missing missing missing
5 │ 0001683168-21-001161 1622879 Mohit Teacher 20201231 20191231 missing missing missing
julia> df.date1 = string.(df.date1) # need to do it as in your example you are mixing string and integer columns later
5-element Vector{String}:
"20170331"
"20190331"
"20190331"
"20191231"
"20201231"
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(==("missing"), Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !(df2[i, n] in Vector(dfr[r"date"])) && df2[i, n] != "missing"
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String String Int64 String String String String String String
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing
This is not an optimized code - I was just writing along reading your description. I hope it shows you how you can approach the task.
If you have missing
not "missing"
do e.g.:
julia> df.date4 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> df.date5 = missings(Int, 5)
5-element Vector{Union{Missing, Int64}}:
missing
missing
missing
missing
missing
julia> combine(groupby(df, :name)) do sdf
df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
dfr = first(df2)
pos=findfirst(ismissing, Vector(dfr[r"date"]))
for i in 2:nrow(df2), n in names(df2, r"date")
if !ismissing(df2[i, n]) && !(any(x -> isequal(df2[i, n], x), dfr[r"date"]))
pos > 5 && error("unsupported data")
dfr[r"date"][pos] = df2[i, n]
pos += 1
end
end
dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
return dfr
end
2×9 DataFrame
Row │ name Roll id tag date1 date2 date3 date4 date5
│ String7 String31 Int64 String7 Int64 Int64? Int64? Int64? Int64?
─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Rohit 0001443646-21-000079 1443646 Student 20210331 20200331 20190331 20180331 20170331
2 │ Mohit 0001683168-21-001161 1622879 Teacher 20201231 20191231 missing missing missing