Search code examples
importstatadelimitertxt

Stata: importing .txt with inconsistent delimiters


I have a .txt file with relatively odd delimiters. Data looks something like this:

|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,|     |,|FOURTH QUARTER REPORT|,||
|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac  |,|MID-YEAR REPORT|,||
|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|

So there is the issue of having some variables without any pipes like the sixth variable here which is just an amount without pipes and some variables only come without pipes if they are empty like the fourth variable here which is either ,, or ,|y|,. Some of the variables also have a comma, so I can not use a comma as the delimiter. So basically there is two problems:

  1. Delimiters are commas but commas also show up inside string values
  2. Some variables are within pipes, some are not and some are only when they are not empty

I am looking for a way to solve this in Stata. Does anyone have an idea how to?


Solution

  • If the full dataset is messier than this example, I really don't want to know. But this seems to make some sense of it.

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str100 whatever
    "|ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,|     |,|FOURTH QUARTER REPORT|,||"
    "|ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac  |,|MID-YEAR REPORT|,||"
    "|ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ|"
    end
    
    gen work = whatever
    replace work = subinstr(work, ",,", ",||,", .)
    
    forval j = 1/5 {
        gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
        replace work = subinstr(work, work`j', "", 1)
    }
    
    gen work6 = substr(work, 1, strpos(work, ","))
    replace work = subinstr(work, work6, "", 1)
    
    forval j = 7/8 {
        gen work`j' = substr(work, 1, strpos(work, "|,") + 1)
        replace work = subinstr(work, work`j', "", 1)
    }
    
    gen work9 = work  
    drop work 
    
    forval j = 1/9 { 
        replace work`j' = trim(subinstr(work`j', "|", "", .)) 
        replace work`j' = substr(work`j', 1, length(work`j') - 1) if substr(work`j', -1, 1) == ","
    }
    
    list 
    
        +-----------------------------------------------------------------------------------+
      1. |                                                                          whatever |
         |    |ABC4|,|Name1|,|NameRaw1|,|y|,|XY1|,10000.0,|     |,|FOURTH QUARTER REPORT|,|| |
         |-----------------------------------------------------------------------------------|
         | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
         |  ABC4  |            Name1  | NameRaw1  |      y  |    XY1  |   10000.0  |         |
         |-----------------------------------------------------------------------------------|
         |                              work8              |              work9              |
         |              FOURTH QUARTER REPORT              |                                 |
         +-----------------------------------------------------------------------------------+
    
         +-----------------------------------------------------------------------------------+
      2. |                                                                          whatever |
         | |ABC5|,|Name2, extraname|,|NameRaw2|,,|XY2|,266539.0,|pac  |,|MID-YEAR REPORT|,|| |
         |-----------------------------------------------------------------------------------|
         | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
         |  ABC5  | Name2, extraname  | NameRaw2  |         |    XY2  |  266539.0  |  pac    |
         |-----------------------------------------------------------------------------------|
         |                              work8              |              work9              |
         |                    MID-YEAR REPORT              |                                 |
         +-----------------------------------------------------------------------------------+
    
         +-----------------------------------------------------------------------------------+
      3. |                                                                          whatever |
         |      |ABC6|,|Name3|,|NameRaw3|,|y|,|X,Y3|,60000.0,|name |,|YEAR-END REPORT|,|XYZ| |
         |-----------------------------------------------------------------------------------|
         | work1  |            work2  |    work3  |  work4  |  work5  |     work6  |  work7  |
         |  ABC6  |            Name3  | NameRaw3  |      y  |   X,Y3  |   60000.0  |  name   |
         |-----------------------------------------------------------------------------------|
         |                              work8              |              work9              |
         |                    YEAR-END REPORT              |                XYZ              |
         +-----------------------------------------------------------------------------------+