I have a sheet containing the following data: URL A, URL B, similar score in percent.
If URL A is 98% similar to URL B, it means that URL B is 98% similar to URL A, and listed as well.
I want to find and eliminate these duplicates/reversed entries. For now, I have tried having two extra columns concatenating URL A+URL B in one, and URL B+URL A in one. This way I have unique identifiers.
After this I'm kinda stuck, because I'm dealing with a lot of variables, as data is in two different rows, and two different columns. I might be looking into a script, taking the A+B value, iterating through the B+A value until it finds a match, and somehow marks this (or simply just deletes it), since my knowledge of formulas for highlighting these duplicates are falling short.
This sheet shows the concept - the first 100 rows (it's about 11K in total): https://docs.google.com/spreadsheets/d/1YKsguAn1lYjV4FlP_6_TlKGvFcpFAEzn7bpAyOEmozQ/edit?usp=sharing
Any suggestions for what I should look into?
I ended up with a solution where I sorted by URL A and implemented this formula:
=IF(A2<B2,A2&B2,B2&A2)
This way I had the concatenation the same way for the real one and the opposite. I didn't know you could use "<" on strings.
After this, I could delete duplicated values in the column with the formula above.