My data has over 50,000+ observations. The problem is line feeder LF
is inserted all over the place causing import into statistical software like STATA a nightmare. I have tried many different options in STATA and finally given up on STATA. Now, after spending half a day in Notepad++, I found out that manually deleting "LF" wherever the data is pushed into a new row fixes the problem.
I tried using 'replace all' 'LF' with nothing(empty cell) but it results in all data clumping in one single line (softwares(Excel, STATA) interprets this as a lengthy variables list).
I am hoping someone might have encountered issues like this and have found a solution. If anyone could share those solutions it would be awesome!
Messy data in notepad++ displays as:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger",.LF
"pizza"LF
"pasta"LF
"B2qwX"|"F"|"salad".LF
"keto diet",LF
""LF
I want a clean data as:
"id"|"sex"|"choice"LF
"1aef2"|"M"|"burger","pizza""pasta"LF
"B2qwX"|"F"|"salad""keto diet",""LF
Please help!
This will remove linefeed when the next line doesn't contain a pipe character.
\n(?!.*\|)
LEAVE EMPTY
. matches newline
Explanation:
\n # linefeed
(?! # negative lookahead, make we haven't, after:
.* # 0 or more any character but newline
\| # a pipe character
) # end lookahead
Screenshot (before):
Screenshot (after):