Search code examples
rubycsvparsingstring-parsing

Parse pipe delimited in Ruby


I am trying to parse a pipe delimited file. Something like

parsed_string = "field1|field2".split("|") -> ["field1", "field2"]

is easy. But how could I parse something where the pipes surround each field like

"|field1|field2_has_a_|_in_it|field3|field4 is really ||| happy| -> ["field1", "field2_has_a_|_in_it","field3", "field4 is really ||| happy"]?


Solution

  • With only one sample string, it's hard to know if there are any edge cases where this won't work, but you can use a positive look ahead and look behind to split on only the pipes next to either an alphabetic or numeric character, in the one sample you gave us:

    string = "|field1|field2_has_a_|_in_it|field3|field4 is really ||| happy|"
    string.split(/(?<=\p{Alnum}|\A)\|(?=\p{Alnum}|\z)/).reject(&:empty?)
    # => ["field1", "field2_has_a_|_in_it", "field3", "field4 is really ||| happy"]
    

    So quick rundown of that Regex, (?<=\p{Alnum}|\A) is a positive look behind, that checks the if the previous character was alphanumeric or the start of the string. \| matches the single pipe character. (?=\p{Alnum}|\z) is a positive look ahead to see if the next character is alphanumeric or the end of the string.

    This only works assuming that the characters surrounding the pipes you want to split on are alphanumeric, and at least one of the surrounding characters won't be alphanumeric on the pipes where you don't split. If, for example, there are some pipes where a series of 3 pipes needs to be split and sometimes where it doesn't, things get a lot more complicated really quickly.