python r python-3.x logistic-regression non-linear-regression

Automated way to get binary variables from a database

I have a problem in relation to a database related to dengue. I have in this database some variables, among them the variable "Cases", which indicates the number of dengue cases in a given period. But I want to apply the logistic regression model to these data, so the idea is to make this variable with integers, to become a binary variable, that is, for places that did not have dengue cases in that period, I want to put 0 in place of the quantity that I already have, and for places that have had cases, put 1. As there are 35628 lines, I want to do it in an automated way, to avoid doing it, manually. Would anyone have any idea how to proceed in order to put this idea into practice? I'm new to programming and I'm trying to implement it in the R language. If they know of a package that does this, it helps a lot. Each neighborhood is conditioned to a number.

I appreciate any help and thank you very much.

neighborhood	Dates	Cases	precipitation	Temperature
0	Jan/14	10	149,6	33,25
1	Fev/14	0	254	30,1
2	Mar/14	6	150	25,4
3	Apr/14	0	244,1	32,5
4	May/14	3	44,3	33,2

I appreciate any help and thank you very much.

Solution

R

Pick from among

dat$CasesBin1 <- (dat$Cases > 0)
dat$CasesBin2 <- +(dat$Cases > 0)
dat
#   neighborhood  Dates Cases precipitation Temperature CasesBin1 CasesBin2
# 1            0 Jan/14    10         149.6       33.25      TRUE         1
# 2            1 Fev/14     0         254.0       30.10     FALSE         0
# 3            2 Mar/14     6         150.0       25.40      TRUE         1
# 4            3 Apr/14     0         244.1       32.50     FALSE         0
# 5            4 May/14     3          44.3       33.20      TRUE         1

In R at least, most logistic regression tools I've used work fine with either integer (0/1) or logical, but you may need to verify with the tools you will use.

Data:

dat <- structure(list(neighborhood = 0:4, Dates = c("Jan/14", "Fev/14", "Mar/14", "Apr/14", "May/14"), Cases = c(10L, 0L, 6L, 0L, 3L), precipitation = c(149.6, 254, 150, 244.1, 44.3), Temperature = c(33.25, 30.1, 25.4, 32.5, 33.2)), class = "data.frame", row.names = c(NA, -5L))

python

In [13]: dat
Out[13]: 
   neighborhood   Dates  Cases  precipitation  Temperature
0             0  Jan/14     10          149.6        33.25
1             1  Fev/14      0          254.0        30.10
2             2  Mar/14      6          150.0        25.40
3             3  Apr/14      0          244.1        32.50
4             4  May/14      3           44.3        33.20

In [17]: dat['CasesBin1'] = dat['Cases'].apply(lambda x: (x > 0))
In [18]: dat['CasesBin2'] = dat['Cases'].apply(lambda x: int(x > 0))
In [19]: dat
Out[19]: 
   neighborhood   Dates  Cases  ...  Temperature  CasesBin1  CasesBin2
0             0  Jan/14     10  ...        33.25       True          1
1             1  Fev/14      0  ...        30.10      False          0
2             2  Mar/14      6  ...        25.40       True          1
3             3  Apr/14      0  ...        32.50      False          0
4             4  May/14      3  ...        33.20       True          1

[5 rows x 7 columns]

Data:

In [11]: js
Out[11]: '[{"neighborhood":0,"Dates":"Jan/14","Cases":10,"precipitation":149.6,"Temperature":33.25},{"neighborhood":1,"Dates":"Fev/14","Cases":0,"precipitation":254,"Temperature":30.1},{"neighborhood":2,"Dates":"Mar/14","Cases":6,"precipitation":150,"Temperature":25.4},{"neighborhood":3,"Dates":"Apr/14","Cases":0,"precipitation":244.1,"Temperature":32.5},{"neighborhood":4,"Dates":"May/14","Cases":3,"precipitation":44.3,"Temperature":33.2}]'
In [12]: dat = pd.read_json(js)