Search code examples
notepad++

Line feeder in unwanted places causing problem in reading data


My data has over 50,000+ observations. The problem is line feeder LF is inserted all over the place causing import into statistical software like STATA a nightmare. I have tried many different options in STATA and finally given up on STATA. Now, after spending half a day in Notepad++, I found out that manually deleting "LF" wherever the data is pushed into a new row fixes the problem.

I tried using 'replace all' 'LF' with nothing(empty cell) but it results in all data clumping in one single line (softwares(Excel, STATA) interprets this as a lengthy variables list).

I am hoping someone might have encountered issues like this and have found a solution. If anyone could share those solutions it would be awesome!

Messy data in notepad++ displays as:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF

I want a clean data as:

"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF

Please help!


Solution

  • This will remove linefeed when the next line doesn't contain a pipe character.

    • Ctrl+H
    • Find what: \n(?!.*\|)
    • Replace with: LEAVE EMPTY
    • CHECK Wrap around
    • CHECK Regular expression
    • UNCHECK . matches newline
    • Replace all

    Explanation:

    \n          # linefeed
    (?!         # negative lookahead, make we haven't, after:
        .*          # 0 or more any character but newline
        \|          # a pipe character
    )           # end lookahead
    

    Screenshot (before):

    enter image description here

    Screenshot (after):

    enter image description here