Search code examples
stata

How to define variable types while loading dataset using column names rather than column index?


When I need to load a dataset where some specific columns must be read as string I type:

import delimited "data.csv", stringcols(13 15 16)

Where 13, 15 and 16 are the number of columns in my dataset that I want to read as string. However, I usually don't know the columns' numbers beforehand and it would be nice to be able to do the same thing using the columns' names.

I tried:

import delimited "data.csv", stringcols(var1 var2 var3)

But stringcols doesn't accept non-numeric arguments. Is there a way to specify columns types using the columns' names rather than numbers?

In python I can do this using:

df=pd.read_csv("data.csv", dtype={k:str for k in ["var1", 'var2', 'var3']})

I'm looking for a similar method in Stata.


Solution

  • I actually encounter this problem a lot. As Nick said, the current version of import delimited does not support variable names.

    A workaround is to import all columns as strings by specifying the stringcols(_all) option and then manually convert those variables of interest.

    For example,

    import delimited data.csv, stringcols(_all) clear
    destring var4 var5 var6, replace
    

    If you have a very long list of variables and want to destring all variables apart from say var1 var2 var3, you can use the ds command (thanks to Nick for the suggestion):

    import delimited data.csv, stringcols(_all) clear
    ds var1 var2 var3, not //specify a list of variables to be excluded 
    destring `r(varlist)', replace 
    

    If you are working with a very large number of unstructured Excel files, you may find the excelclean command useful.