Search code examples
importstatadelimitercsv

Importing text file with length delimiter


I have a text file, which contains only numbers.

For example:

2001 31110
199910 311

Its layout can be explained as follows:

1~4th numbers : Year
5~6th numbers : Month
7~8th numbers : Day
9th number : Sex
10th number : Married

However, I can't decide how to import this file into Stata.

For instance, if I use the command:

import delimited input.txt, delimiter(??)

What should I write in delimiter?

I don't necessarily need to use the above. I just want to import the data using whatever method.


Solution

  • The answer depends on what you want to do with the data later.

    My understanding is that the spaces indicate a single digit for date-related numbers and that in the text file, only month or day can be single digit but not both. In addition, sex and married are binary indicators taking values 0 and 1.

    Assuming the above are correct and the data below are included in a file data.txt:

    2001 31110
    199910 311
    1983 41201
    2012121500
    

    Here's one way to do it:

    clear 
    import delimited data.txt, delimiter(" ") stringcols(_all)  
    
    list
    
         +--------------------+
         |         v1      v2 |
         |--------------------|
      1. |       2001   31110 |
      2. |     199910     311 |
      3. |       1983   41201 |
      4. | 2012121500         |
         +--------------------+
    
    replace v2 = "0" + v2 if v2 != ""
    generate v3 = v1 + v2
    
    generate year = substr(v3, 1, 4)
    generate month = substr(v3, 5, 2)
    generate day = substr(v3, 7, 2)
    
    generate date = substr(v3, 1, 8)
    
    generate sex = substr(v3, 9, 1)
    generate married = substr(v3, 10, 1)
    
    list
         +----------------------------------------------------------------------------------+
         |         v1       v2           v3   year   month   day       date   sex   married |
         |----------------------------------------------------------------------------------|
      1. |       2001   031110   2001031110   2001      03    11   20010311     1         0 |
      2. |     199910     0311   1999100311   1999      10    03   19991003     1         1 |
      3. |       1983   041201   1983041201   1983      04    12   19830412     0         1 |
      4. | 2012121500            2012121500   2012      12    15   20121215     0         0 |
         +----------------------------------------------------------------------------------+
    

    You basically import everything in a maximum of two string variables, with a single space " " acting as a separator. The single-digit months or days are changed to two digits by adding a 0 at the front. Then, after you extract the relevant parts of the strings using the substr() function, you can simply convert the resulting variables to numeric as needed.

    For example:

    destring year month day sex married, replace
    
    generate date2 = daily(date, "YMD")
    format date2 %tdDD-NN-CCYY
    
    . list date2 
    
         +------------+
         |      date2 |
         |------------|
      1. | 11-03-2001 |
      2. | 03-10-1999 |
      3. | 12-04-1983 |
      4. | 15-12-2012 |
         +------------+
    

    If in your text file both month and day contain single digits, you follow the same logic as above but you will need to deal with a third variable as well after you import the data.