I would like to remove text between commas and dashes in a long string of variable labels saved as a comma-separated. Here's a minimal example of my string:
myvarlabels <- ("participant number, How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")
Importantly, the variable labels appear in two different forms and should be shortened in the following way:
I tried to use gsub and regular expressions to identify and then delete the text between the commas and the dash (i.e., replacing the text with "").
Has anyone a suggestion as to how I could use gsub to remove the text between the commas that indicate the start of a new column and the dashes that are followed by the text that I want to keep while preserving the double quotes?
EDIT 1
To be more precise, the data include three types of comma-separated chunks of text. They all specify what information the corresponding variables contain:
short descriptions including one or more words (e.g., participant number)
longer descriptions where the relevant information only appears after a dash (e.g., How much do you like the following products-green tea)
same as above but with commas present somewhere before the dash (e.g., How much, if anything at all, would you ...); this is why this type of chunk of text is preceded and followed by \" (otherwise they are not correctly read)
The four types of text sequences are all preceded and followed by commas and can appear in any order.
Here's a new minimal example that more accurately reflects the real data than my first example:
(myvarlabels3 <- ("participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"How much experience do you have with the following products-Indian spices\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\",email,telephone number"))
Cath's code (Edit 2) works up to a certain point. When I add more of the "simple" type 1 sequences of text at the beginning of the string or when I add a text sequence specified under 4. in the above list, the code doesn't work properly anymore.
However, when Cath's code from Edit 2 is run in two steps, then it works perfectly:
myvarlabels3 <- gsub("((?<=,\")[^-]*[^-]+-)|((?<=,\")[^-],*[^-]+-)", "", myvarlabels3, perl=TRUE) # step 1: shorten the text sequences specified under 3. and 4. in the list above
[1] "participant number,age,gender,body mass index,How much do you like the following products-green tea,How much do you like the following products-beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"
gsub("((?<=,)[^-\",]+-)", "", myvarlabels3, perl=TRUE) # step 2: shorten the text sequences specified as 2. in the above list
[1] "participant number,age,gender,body mass index,green tea,beer,outdoor temperature,season,\"Indian spices\",\"Japanese, Chinese, and Indian beer\",email,telephone number"
I think it would probably be possible to only use one line of code but I couldn't figure out how. Anyway, this will greatly facilitate my workflow when I import messy csv files from Qualtrics.
I'm not sure I understand what your desired output is, but you can try spotting the "start of a new column" based on "How much" and then go until you "meet" a dash:
gsub("(^[^,]+, )|(How much[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""
EDIT
Considering your patterns, you can try the following:
gsub("((?<=, )[^-\"]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels, perl=TRUE)
[1] "participant number, green tea, beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""
I use 2 possible patterns, according to the 2 possible ones you described, with look behinds to specify what should be there but needs to be kept
EDIT2
If you don't have a space between the comma and the question that doesn't begin with quote, you can do:
myvarlabels_2 <- ("participant number,How much do you like the following products-green tea, How much do you like the following products-beer,\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian green tea\",\"How much, if anything at all, would you be willing to pay for these products if they were ...-Japanese, Chinese, and Indian beer\"")
gsub("((?<=,)[^-\",]+-)|((?<=,\")[^-]*,[^-]+-)", "", myvarlabels_2, perl=TRUE)
[1] "participant number,green tea,beer,\"Japanese, Chinese, and Indian green tea\",\"Japanese, Chinese, and Indian beer\""