Search code examples
juliadataframes.jl

I have a DataFrame, I need a required format from it. I am putting more details below


I have a DataFrame in the following format:

julia> using DataFrames, CSV

julia> str = """Roll,id,name,tag,date1,date2,date3,date4,date5
       0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331
       0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331
       0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331
       0001683168-20-001021,1622879,Mohit,Teacher,20191231
       0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231
       """
"Roll,id,name,tag,date1,date2,date3,date4,date5\n0001443646-19-000093,1443646,Rohit,Student,20170331,20180331,20190331\n0001443646-20-000086,1443646,Rohit,Student,20190331,20200331,20180331\n0001443646-21-000079,1443646,Rohit,Student,20190331,20210331,20200331\n0001683168-20-001021,1622879,Mohit,Teacher,20191231\n0001683168-21-001161,1622879,Mohit,Teacher,20201231,20191231\n"

julia> df = CSV.read(IOBuffer(str), DataFrame)
5×9 DataFrame
 Row │ Roll                  id       name    tag      date1     date2     date3     date4    date5
     │ String                Int64    String  String   Int64     Int64?    Int64?    Missing  Missing
─────┼────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0001443646-19-000093  1443646  Rohit   Student  20170331  20180331  20190331  missing  missing
   2 │ 0001443646-20-000086  1443646  Rohit   Student  20190331  20200331  20180331  missing  missing
   3 │ 0001443646-21-000079  1443646  Rohit   Student  20190331  20210331  20200331  missing  missing
   4 │ 0001683168-20-001021  1622879  Mohit   Teacher  20191231   missing   missing  missing  missing
   5 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231   missing  missing  missing

Now, first, let's talk about df and let's talk about 1st name in :name column which is Rohit. So, the id and tags are the same for name Rohit but the string in Roll is different and there is some duplicates date for that particular name. So, now let's have a look at simplified DataFrame df1 and I will talk about how I manipulated it manually:

Now, the target DataFrame I want to achieve is:

julia> df1 = CSV.read(IOBuffer(str), DataFrame)
2×9 DataFrame
 Row │ Roll                  id       name    tag      date1     date2     date3     date4     date5
     │ String                Int64    String  String   Int64     Int64     Int64?    Int64?    Int64?
─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ 0001443646-21-000079  1443646  Rohit   Student  20210331  20210331  20190331  20180331  20170331
   2 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231   missing   missing   missing

Lets focus on some points:

  1. First lets focus on :Roll column. In this column you will see there are three different roll strings, for example, 0001443646-21-000079 now here in the between 21 is our higher precedence to consider in comparison to the other two strings 0001443646-20-000086, and 0001443646-19-000093 where we will get 20, and 19 respectively. So according to 0001443646-21-000079 the extracted dates will be:
20190331,20210331,20200331,missing,missing

But now I will make a decreasing order again (means we will put recent years first followed by past years), for examples,

20210331,20200331,20190331,missing,missing

same for the other two strings:

0001443646-19-000093,1443646,Rohit,Student,20190331,20180331,20170331,missing,missing
0001443646-20-000086,1443646,Rohit,Student,20200331,20190331,20180331,missing,missing

Now, as we will consider higher precedence Roll first so if we will encounter duplicates from other two roll string we will not consider it. The final format for name Rohit will look like this:

0001443646-21-000079,1443646,Rohit,Student,20210331,20210331,20190331,20180331,20170331

If you will see in other two Roll strings we are having duplicates. Now, I want to convert the whole idea into the script format. I am having some difficulties with it. Can someone will help me?


Solution

  • I assume you are using DataFrames.jl 1.2.2 and CSV.jl 0.9.2 (current versions).

    Then do:

    julia> df = CSV.read(IOBuffer(str), DataFrame, stringtype=String)
    5×9 DataFrame
     Row │ Roll                  id       name    tag      date1     date2     date3     date4    date5
         │ String                Int64    String  String   Int64     String    String    String   String
    ─────┼────────────────────────────────────────────────────────────────────────────────────────────────
       1 │ 0001443646-19-000093  1443646  Rohit   Student  20170331  20180331  20190331  missing  missing
       2 │ 0001443646-20-000086  1443646  Rohit   Student  20190331  20200331  20180331  missing  missing
       3 │ 0001443646-21-000079  1443646  Rohit   Student  20190331  20210331  20200331  missing  missing
       4 │ 0001683168-20-001021  1622879  Mohit   Teacher  20191231  missing   missing   missing  missing
       5 │ 0001683168-21-001161  1622879  Mohit   Teacher  20201231  20191231  missing   missing  missing
    
    julia> df.date1 = string.(df.date1) # need to do it as in your example you are mixing string and integer columns later
    5-element Vector{String}:
     "20170331"
     "20190331"
     "20190331"
     "20191231"
     "20201231"
    
    julia> combine(groupby(df, :name)) do sdf
           df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
           dfr = first(df2)
           pos=findfirst(==("missing"), Vector(dfr[r"date"]))
           for i in 2:nrow(df2), n in names(df2, r"date")
               if !(df2[i, n] in Vector(dfr[r"date"])) && df2[i, n] != "missing"
                   pos > 5 && error("unsupported data")
                   dfr[r"date"][pos] = df2[i, n]
                   pos += 1
               end
           end
           dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
           return dfr
           end
    2×9 DataFrame
     Row │ name    Roll                  id       tag      date1     date2     date3     date4     date5
         │ String  String                Int64    String   String    String    String    String    String
    ─────┼──────────────────────────────────────────────────────────────────────────────────────────────────
       1 │ Rohit   0001443646-21-000079  1443646  Student  20210331  20200331  20190331  20180331  20170331
       2 │ Mohit   0001683168-21-001161  1622879  Teacher  20201231  20191231  missing   missing   missing
    

    This is not an optimized code - I was just writing along reading your description. I hope it shows you how you can approach the task.


    If you have missing not "missing" do e.g.:

    julia> df.date4 = missings(Int, 5)
    5-element Vector{Union{Missing, Int64}}:
     missing
     missing
     missing
     missing
     missing
    
    julia> df.date5 = missings(Int, 5)
    5-element Vector{Union{Missing, Int64}}:
     missing
     missing
     missing
     missing
     missing
    
    julia> combine(groupby(df, :name)) do sdf
               df2 = sort(sdf, by=x->parse(Int, split(x,"-")[2]), rev=true)
               dfr = first(df2)
               pos=findfirst(ismissing, Vector(dfr[r"date"]))
               for i in 2:nrow(df2), n in names(df2, r"date")
                   if !ismissing(df2[i, n]) && !(any(x -> isequal(df2[i, n], x), dfr[r"date"]))
                       pos > 5 && error("unsupported data")
                       dfr[r"date"][pos] = df2[i, n]
                       pos += 1
                   end
               end
               dfr[r"date"][1:pos-1] = sort(Vector(dfr[r"date"][1:pos-1]), rev=true)
               return dfr
           end
    2×9 DataFrame
     Row │ name     Roll                  id       tag      date1     date2     date3     date4     date5
         │ String7  String31              Int64    String7  Int64     Int64?    Int64?    Int64?    Int64?
    ─────┼───────────────────────────────────────────────────────────────────────────────────────────────────
       1 │ Rohit    0001443646-21-000079  1443646  Student  20210331  20200331  20190331  20180331  20170331
       2 │ Mohit    0001683168-21-001161  1622879  Teacher  20201231  20191231   missing   missing   missing