Search code examples
regexnotepad++

Restructure CSV data with Notepad++, Regex


I have a CSV file with following headers and (sample) data:

StopName,RouteName,Travel_Direction,Latitude,Longitude
StreetA @ StreetB,1 NameA,DirectionA,Lat,Long
StreetC @ StreetD,1 NameA,DirectionA,Lat,Long
...
StreetE @ StreetF,1 NameA,DirectionB,Lat,Long
StreetG @ StreetH,1 NameA,DirectionB,Lat,Long
...
StreetI @ StreetJ,2 NameB,DirectionC,Lat,Long
StreetK @ StreetL,2 NameB,DirectionC,Lat,Long
...
StreetM @ StreetN,2 NameB,DirectionD,Lat,Long
StreetO @ StreetP,2 NameB,DirectionD,Lat,Long
.
.
.

I am wanting to use regex (currently in Notepad++) to get the following results:

1 NameA - DirectionA=[[StreetA @ StreetB,[Lat,Long]], [StreetC @ StreetD,[Lat,Long]], ...]
1 NameA - DirectionB=[[StreetD @ StreetE,[Lat,Long]], [StreetF @ StreetG,[Lat,Long]], ...]
2 NameB - DirectionC=[[StreetH @ StreetI,[Lat,Long]], [StreetJ @ StreetK,[Lat,Long]], ...]
2 NameB - DirectionD=[[StreetL @ StreetM,[Lat,Long]], [StreetN @ StreetO,[Lat,Long]], ...]
.
.
.

With the Regex and Substitution,

RgX: ^([^,]*),([^,]*),([^,]*),(.*)
Sub: $2 - $3=[$1,[\4]]

Demo: https://regex101.com/r/gS9hD6/1

I have gotten this far:

1 NameA - DirectionA=[StreetA @ StreetB,[Lat,Long]]
1 NameA - DirectionA=[StreetC @ StreetD,[Lat,Long]]
1 NameA - DirectionB=[StreetE @ StreetF,[Lat,Long]]
1 NameA - DirectionB=[StreetG @ StreetH,[Lat,Long]]
2 NameB - DirectionC=[StreetI @ StreetJ,[Lat,Long]]
2 NameB - DirectionC=[StreetK @ StreetL,[Lat,Long]]
2 NameB - DirectionD=[StreetM @ StreetN,[Lat,Long]]
2 NameB - DirectionD=[StreetO @ StreetP,[Lat,Long]]

In a new regex, I tried splitting the above result on "=", but didn't know where to go from there.

I think one way to get the desired results would be to keep first unique instance of what's before "=", replace new line with "," and enclose it with a [..] to make it an array form.

Edit: There are about 10k stops (total), but only about 100 unique routes.

Edit 2: (maybe I am asking for too many changes now)

For first regex:

  • What if I want to use "\n" instead of "="?

At beginning of 2nd regex replacement,

  • What if I have only RouteName and StopName columns, like this: 1 NameA - DirectionA=[StreetA @ StreetB, ...]?
  • Similarly, what if I only have RouteName and Coordinates, like this: 1 NameA - DirectionA=[[Lat,Long]]?

Solution

  • Steps

    1. First replacement:

    • Find what: ^([^,]*),([^,]*),([^,]*),(.*)
    • Replace with: \2 - \3=[[\1,[\4]]]
    • Replace All

    2. Second replacement:

    • Find what: ^[\S\s]*?^([^][]*=)\[\[.*\]\]\K\]\R\1\[(.*)\]$
    • Replace with: , \2]
    • Replace All

    3. Repeat step 2 until there are no more occurences.

    • This means that if there are 100 instances (Stops) for the same key (Route - Direction pair), you will have to click Replace All 7 times (ceiling(log2(N))).

    Description

    I modified your regex in step 1 to add an extra pair of brackets that will enclose the whole set.

    For step 2, it finds a pair of lines for the same Direction, appending the last to the previous one.

    ^[\S\s]*?^([^][]*=)     #Group 1: captures "1 NameA - DirA="
    \[\[.*\]\]              #matches the set of Stops - "[[StA @ StB,[Lat,Long]], ..."
    \K                      #keeps the text matched so far out of the match
    \]\R                    #closing "]" and newline
    \1                      #match next line (if the same route)
    \[(.*)\]$               #and capture the Stop (Group 2)
    

    regex101 Demo for step 1
    regex101 Demo for step 2