Search code examples
pythonrpandasdataframedplyr

How to swap the columns according to the data type?


The sample data is as below (The data is fake, not real data):

Key Death indicator Date Death Exact date of death Death Cause
00 Alive
02 Death hos Y 25/9/2011 N00
03 Alive
09 Death hos Y J189 28/8/2015
07 Death nonhos 12/6/2018 Y C20

From the table, you can see the types of data within the same columns are not consistent. Date Death should be in date format; Exact date of death should only contain Y, N or blank; Death Cause should be in string (i.e. ICD code).

I forgot to mention one important thing, the date format may not be consistent, e.g. '01-05-2010','01 May 2010' can also appear in the date columns. I tried to perform some basic data cleaning:

Python:

import pandas as pd

death_y_n = death['Date Death'][pd.to_datetime(death['Date Death'], \
                                                                 format='%d/%m/%Y',
                                                                 errors = 'coerce')\
                                                                 .isnull()]

death_disease_case = death['Exact date of death'][~((death['Exact date of death'].isin(['Y','N']))\
                                                      |(death['Exact date of death'].isnull()))]

death['Death Cause'][~pd.to_datetime(\
                                      death['Death Cause'], \
                                      format='%d/%m/%Y', errors = 'coerce')\
                                      .isnull()] = \
                                      death_disease_case

death['Date Death'][pd.to_datetime(\
                                      death['Date Death'], \
                                      format='%d/%m/%Y', errors = 'coerce')\
                                      .isnull()] = \
                                      death_to_date[pd.to_datetime(\
                                                                   death['Date Death'], \
                                                                   format='%d/%m/%Y', errors = 'coerce')\
                                                                   .isnull()]

death['Exact date of death'][~death['Exact date of death'].isin(['Y','N'])] = \
                                  death_y_n[~death['Exact date of death'].isin(['Y','N'])]

death['Death Cause'][pd.to_datetime(\
                                      death['Date Death'], \
                                      format='%d/%m/%Y', errors = 'coerce')\
                                      .isnull()] = \
                                       death_y_n[pd.to_datetime(\
                                      death['Date Death'], \
                                      format='mixed', errors = 'coerce')\
                                      .isnull()]

R:

library(tidyverse)
library(magrittr)
library(anytime)
library(Hmisc)

death_to_date = anytime(death$`Death Cause`) %>% as.character

death_y_n = death$`Date Death`[is.na(as_date(death$`Date Death`))]

death_disease_case = death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')]

death$`Death Cause`[!is.na(as_date(death$`Death Cause` ))] = death_disease_case[!is.na(as_date(death$`Death Cause` ))]

death$`Date of Registered Death`[is.na(as_date(death$`Date Death`))] = death_to_date[is.na(as_date(death$`Date Death`))]

death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')] = death_y_n[death$`Exact date of death` %nin% c('Y','N')]

However, due to multiple formats of date, some date formats cannot be parsed successfully. Is there a method to swap the columns without using to_datetime()/anytime()?

I am new to Python, if there are any mistakes I made, please point them out! Thank you.

Updated:

My python solution:

import pandas as pd

#for death date variable save as exact date of death:'Y'/'N'
death_to_date_index_exact = (death['Date Death'].isin(['Y','N']))
death_to_date_exact = death['Date Death'][death_to_date_index_exact]

#for death cause variable save as date of death
death_cause_index_date = (~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True))
death_cause_date = death['Death Cause'][death_cause_index_date]

#for exact date of death variable save as death cause
death_exact_index_cause = (death['Exact date of death'].str.contains('^[A-Za-z].*[0-9]$',na=False))
death_exact_cause = death['Exact date of death'][death_exact_index_cause]

death['Date Death'][death_cause_index_date] = death_cause_date
death['Exact date of death'][death_to_date_index_exact] = death_to_date_exact
death['Death Cause'][death_exact_index_cause] = death_exact_cause

#Convert the date in death cause into empty
death['Death Cause'][~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True)] = np.nan

Solution

  • There's a few different ways you can do this. The simplest way to me, which doesn't require creating a new vector three times, is to create a named list in R, then unnest it wider, as follows:

    library(tidyverse)
      
    df |>
      mutate(sorted = pmap(list(Date.Death, Exact.date.of.death, Death.Cause), \(...) {out <- list()
      for (x in c(...)) { # for each row of the three columns above
        if (x %in% c("Y", "N", NA)) { # we do this one first, otherwise the NAs throw things off
          out$Exact.date.of.death <- x 
        } else if (grepl("/", x)) {
          out$Date.Death <- x
        } else {
          out$Death.Cause <- x
        }
      }
      out
      }), .keep = "unused") |> # drop the columns we used (i.e. Date.Death, Exact.date.of.death, Death.Cause)
      unnest_wider(sorted) # then unnest the named list column, making new versions of them!
    

    Output:

    # A tibble: 5 × 5
      Key   Death.indicator Exact.date.of.death Date.Death Death.Cause
      <chr> <chr>           <chr>               <chr>      <chr>      
    1 Alive NA              NA                  NA         NA         
    2 Death hos             Y                   25/9/2011  N00        
    3 Alive NA              NA                  NA         NA         
    4 Death hos             Y                   28/8/2015  J189       
    5 Death nonhos          Y                   12/6/2018  C20        
    

    Data:

    df <- read.table(text="Key  Death.indicator     Date.Death  Exact.date.of.death     Death.Cause
    00  Alive NA NA NA NA
    02  Death hos   Y   25/9/2011   N00
    03  Alive NA NA NA NA       
    09  Death hos   Y   J189    28/8/2015
    07  Death nonhos    12/6/2018   Y   C20", header=TRUE)
    

    Notes:

    1. You've said you've had trouble parsing your dates. It would be helpful if you could give some examples of the dates you are struggling to parse.
    2. Your data still needs some cleaning - for instance, "Y" and "N" should be TRUE and FALSE. Also if the data is death data, then it would make sense to remove the Alive ones, they add nothing. Also, blanks should NA.