Search code examples
regexgoogle-sheetsfiltergoogle-sheets-formulaarray-formulas

How to make Google Sheets search a range and return -all- rows that are "partial duplicates"


I'm trying to return "duplicates" from a range. In this case a duplicate is when there exists more than one row that has the same data in the first and last column (the data in the middle columns needs to be returned, but is irrelevant in terms of having useful data for the search to be performed on).

For a small example data set and desired output see this sheet.

My current incomplete solution path is as follows:

I use

=QUERY({SourceData!A2:E,ARRAYFORMULA(IF(LEN(SourceData!A2:A),COUNTIFS(SourceData!A2:A&SourceData!E2:E,SourceData!A2:A&SourceData!E2:E,ROW(SourceData!A2:A),"<="&ROW(SourceData!A2:A)),))},"select Col1, Col2, Col3, Col4, Col5 where Col6 > 1")

where the ARRAYFORMULA appends a rolling count column to the end of the range and then QUERY the rows of the original range where the rolling count is above 1.

However, this only gives me the subsequent rows and not the first of the duplicates. (In the example it only gives me the second row of the matching pair and not the first.)

I'm tempted to limit the QUERY output to just column 1 and then wrap that output in a JOIN to make the output conditions of another QUERY. But given the size of the actual data set and the sheer number of IMPORTRANGEs and QUERYs I've already got going I'm starting to worry about efficiency. (I've got 12 Google Sheet documents all importing from a 13th Google Sheet document then the 13th document pulls and combines data from the 12 other sheets and spits subsets of the combined data set back to each of the 12 other documents.) The whole thing won't be usable if a user has to wait multiple minutes while all the functions resolve. Plus I'm sure someone out there has a more elegant way of getting this done that would be helpfully enlightening to an amateur such as me.

Advice is appreciated! Thank you for your time.


Solution

  • try:

    ={SourceData!A1:E1;
     ARRAYFORMULA(FILTER(SourceData!A2:E, REGEXMATCH(SourceData!A2:A&SourceData!E2:E, 
     TEXTJOIN("|", 1, FILTER(SourceData!A2:A&SourceData!E2:E, 
     COUNTIFS(SourceData!A2:A&SourceData!E2:E, SourceData!A2:A&SourceData!E2:E, 
     ROW(SourceData!A2:A), "<="&ROW(SourceData!A2:A))>=2)))))}
    

    0