Search code examples
pythonrpython-3.xlogistic-regressionnon-linear-regression

Automated way to get binary variables from a database


I have a problem in relation to a database related to dengue. I have in this database some variables, among them the variable "Cases", which indicates the number of dengue cases in a given period. But I want to apply the logistic regression model to these data, so the idea is to make this variable with integers, to become a binary variable, that is, for places that did not have dengue cases in that period, I want to put 0 in place of the quantity that I already have, and for places that have had cases, put 1. As there are 35628 lines, I want to do it in an automated way, to avoid doing it, manually. Would anyone have any idea how to proceed in order to put this idea into practice? I'm new to programming and I'm trying to implement it in the R language. If they know of a package that does this, it helps a lot. Each neighborhood is conditioned to a number.

I appreciate any help and thank you very much.

neighborhood Dates Cases precipitation Temperature
0 Jan/14 10 149,6 33,25
1 Fev/14 0 254 30,1
2 Mar/14 6 150 25,4
3 Apr/14 0 244,1 32,5
4 May/14 3 44,3 33,2

I appreciate any help and thank you very much.


Solution

  • R

    Pick from among

    dat$CasesBin1 <- (dat$Cases > 0)
    dat$CasesBin2 <- +(dat$Cases > 0)
    dat
    #   neighborhood  Dates Cases precipitation Temperature CasesBin1 CasesBin2
    # 1            0 Jan/14    10         149.6       33.25      TRUE         1
    # 2            1 Fev/14     0         254.0       30.10     FALSE         0
    # 3            2 Mar/14     6         150.0       25.40      TRUE         1
    # 4            3 Apr/14     0         244.1       32.50     FALSE         0
    # 5            4 May/14     3          44.3       33.20      TRUE         1
    

    In R at least, most logistic regression tools I've used work fine with either integer (0/1) or logical, but you may need to verify with the tools you will use.

    Data:

    dat <- structure(list(neighborhood = 0:4, Dates = c("Jan/14", "Fev/14", "Mar/14", "Apr/14", "May/14"), Cases = c(10L, 0L, 6L, 0L, 3L), precipitation = c(149.6, 254, 150, 244.1, 44.3), Temperature = c(33.25, 30.1, 25.4, 32.5, 33.2)), class = "data.frame", row.names = c(NA, -5L))
    

    python

    In [13]: dat
    Out[13]: 
       neighborhood   Dates  Cases  precipitation  Temperature
    0             0  Jan/14     10          149.6        33.25
    1             1  Fev/14      0          254.0        30.10
    2             2  Mar/14      6          150.0        25.40
    3             3  Apr/14      0          244.1        32.50
    4             4  May/14      3           44.3        33.20
    
    In [17]: dat['CasesBin1'] = dat['Cases'].apply(lambda x: (x > 0))
    In [18]: dat['CasesBin2'] = dat['Cases'].apply(lambda x: int(x > 0))
    In [19]: dat
    Out[19]: 
       neighborhood   Dates  Cases  ...  Temperature  CasesBin1  CasesBin2
    0             0  Jan/14     10  ...        33.25       True          1
    1             1  Fev/14      0  ...        30.10      False          0
    2             2  Mar/14      6  ...        25.40       True          1
    3             3  Apr/14      0  ...        32.50      False          0
    4             4  May/14      3  ...        33.20       True          1
    
    [5 rows x 7 columns]
    

    Data:

    In [11]: js
    Out[11]: '[{"neighborhood":0,"Dates":"Jan/14","Cases":10,"precipitation":149.6,"Temperature":33.25},{"neighborhood":1,"Dates":"Fev/14","Cases":0,"precipitation":254,"Temperature":30.1},{"neighborhood":2,"Dates":"Mar/14","Cases":6,"precipitation":150,"Temperature":25.4},{"neighborhood":3,"Dates":"Apr/14","Cases":0,"precipitation":244.1,"Temperature":32.5},{"neighborhood":4,"Dates":"May/14","Cases":3,"precipitation":44.3,"Temperature":33.2}]'
    In [12]: dat = pd.read_json(js)