When I need to load a dataset where some specific columns must be read as string I type:
import delimited "data.csv", stringcols(13 15 16)
Where 13, 15 and 16 are the number of columns in my dataset that I want to read as string. However, I usually don't know the columns' numbers beforehand and it would be nice to be able to do the same thing using the columns' names.
I tried:
import delimited "data.csv", stringcols(var1 var2 var3)
But stringcols
doesn't accept non-numeric arguments. Is there a way to specify columns types using the columns' names rather than numbers?
In python I can do this using:
df=pd.read_csv("data.csv", dtype={k:str for k in ["var1", 'var2', 'var3']})
I'm looking for a similar method in Stata.
I actually encounter this problem a lot. As Nick said, the current version of import delimited
does not support variable names.
A workaround is to import all columns as strings by specifying the stringcols(_all)
option and then manually convert those variables of interest.
For example,
import delimited data.csv, stringcols(_all) clear
destring var4 var5 var6, replace
If you have a very long list of variables and want to destring
all variables apart from say var1 var2 var3
, you can use the ds
command (thanks to Nick for the suggestion):
import delimited data.csv, stringcols(_all) clear
ds var1 var2 var3, not //specify a list of variables to be excluded
destring `r(varlist)', replace
If you are working with a very large number of unstructured Excel files, you may find the excelclean command useful.