The sample data is as below (The data is fake, not real data):
Key | Death indicator | Date Death | Exact date of death | Death Cause |
---|---|---|---|---|
00 | Alive | |||
02 | Death hos | Y | 25/9/2011 | N00 |
03 | Alive | |||
09 | Death hos | Y | J189 | 28/8/2015 |
07 | Death nonhos | 12/6/2018 | Y | C20 |
From the table, you can see the types of data within the same columns are not consistent. Date Death
should be in date format; Exact date of death
should only contain Y
, N
or blank; Death Cause
should be in string (i.e. ICD code).
I forgot to mention one important thing, the date format may not be consistent, e.g. '01-05-2010','01 May 2010' can also appear in the date columns. I tried to perform some basic data cleaning:
Python:
import pandas as pd
death_y_n = death['Date Death'][pd.to_datetime(death['Date Death'], \
format='%d/%m/%Y',
errors = 'coerce')\
.isnull()]
death_disease_case = death['Exact date of death'][~((death['Exact date of death'].isin(['Y','N']))\
|(death['Exact date of death'].isnull()))]
death['Death Cause'][~pd.to_datetime(\
death['Death Cause'], \
format='%d/%m/%Y', errors = 'coerce')\
.isnull()] = \
death_disease_case
death['Date Death'][pd.to_datetime(\
death['Date Death'], \
format='%d/%m/%Y', errors = 'coerce')\
.isnull()] = \
death_to_date[pd.to_datetime(\
death['Date Death'], \
format='%d/%m/%Y', errors = 'coerce')\
.isnull()]
death['Exact date of death'][~death['Exact date of death'].isin(['Y','N'])] = \
death_y_n[~death['Exact date of death'].isin(['Y','N'])]
death['Death Cause'][pd.to_datetime(\
death['Date Death'], \
format='%d/%m/%Y', errors = 'coerce')\
.isnull()] = \
death_y_n[pd.to_datetime(\
death['Date Death'], \
format='mixed', errors = 'coerce')\
.isnull()]
R:
library(tidyverse)
library(magrittr)
library(anytime)
library(Hmisc)
death_to_date = anytime(death$`Death Cause`) %>% as.character
death_y_n = death$`Date Death`[is.na(as_date(death$`Date Death`))]
death_disease_case = death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')]
death$`Death Cause`[!is.na(as_date(death$`Death Cause` ))] = death_disease_case[!is.na(as_date(death$`Death Cause` ))]
death$`Date of Registered Death`[is.na(as_date(death$`Date Death`))] = death_to_date[is.na(as_date(death$`Date Death`))]
death$`Exact date of death`[death$`Exact date of death` %nin% c('Y','N')] = death_y_n[death$`Exact date of death` %nin% c('Y','N')]
However, due to multiple formats of date, some date formats cannot be parsed successfully. Is there a method to swap the columns without using to_datetime()
/anytime()
?
I am new to Python, if there are any mistakes I made, please point them out! Thank you.
Updated:
My python solution:
import pandas as pd
#for death date variable save as exact date of death:'Y'/'N'
death_to_date_index_exact = (death['Date Death'].isin(['Y','N']))
death_to_date_exact = death['Date Death'][death_to_date_index_exact]
#for death cause variable save as date of death
death_cause_index_date = (~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True))
death_cause_date = death['Death Cause'][death_cause_index_date]
#for exact date of death variable save as death cause
death_exact_index_cause = (death['Exact date of death'].str.contains('^[A-Za-z].*[0-9]$',na=False))
death_exact_cause = death['Exact date of death'][death_exact_index_cause]
death['Date Death'][death_cause_index_date] = death_cause_date
death['Exact date of death'][death_to_date_index_exact] = death_to_date_exact
death['Death Cause'][death_exact_index_cause] = death_exact_cause
#Convert the date in death cause into empty
death['Death Cause'][~death['Death Cause'].str.contains('^[A-Za-z].*[0-9]$',na=True)] = np.nan
There's a few different ways you can do this. The simplest way to me, which doesn't require creating a new vector three times, is to create a named list in R, then unnest it wider, as follows:
library(tidyverse)
df |>
mutate(sorted = pmap(list(Date.Death, Exact.date.of.death, Death.Cause), \(...) {out <- list()
for (x in c(...)) { # for each row of the three columns above
if (x %in% c("Y", "N", NA)) { # we do this one first, otherwise the NAs throw things off
out$Exact.date.of.death <- x
} else if (grepl("/", x)) {
out$Date.Death <- x
} else {
out$Death.Cause <- x
}
}
out
}), .keep = "unused") |> # drop the columns we used (i.e. Date.Death, Exact.date.of.death, Death.Cause)
unnest_wider(sorted) # then unnest the named list column, making new versions of them!
Output:
# A tibble: 5 × 5
Key Death.indicator Exact.date.of.death Date.Death Death.Cause
<chr> <chr> <chr> <chr> <chr>
1 Alive NA NA NA NA
2 Death hos Y 25/9/2011 N00
3 Alive NA NA NA NA
4 Death hos Y 28/8/2015 J189
5 Death nonhos Y 12/6/2018 C20
Data:
df <- read.table(text="Key Death.indicator Date.Death Exact.date.of.death Death.Cause
00 Alive NA NA NA NA
02 Death hos Y 25/9/2011 N00
03 Alive NA NA NA NA
09 Death hos Y J189 28/8/2015
07 Death nonhos 12/6/2018 Y C20", header=TRUE)
Notes:
"Y"
and "N"
should be TRUE
and FALSE
. Also if the data is death data, then it would make sense to remove the Alive ones, they add nothing. Also, blanks should NA
.