Search code examples
google-sheetsduplicatesgoogle-sheets-formulamultiple-columns

How to query/filter duplicate rows with multiple criteria?


I'm trying to query/filter rows from a dataset structured like this:

Creator Title Barcode Inv. No.
springer Cellbio 014678 POL02P14x
springer Cellbio 026938 POL02P26r
springer Cellbio 038745
nature Cellular 026672 POL02P26h
elsevier Biomed 026678 POL02P26g
elsevier Biomed 026678 POL02P26g
spring Cellbit POL02P147
spring Cellbit 026938 POL02P26j
spring Cellbit 038745

I need to return all rows where the value/string in column B(title) is duplicate and when in those duplicate rows at least one string/value in column C(barcode) starts with 014 and at least one starts with 026. If the criteria is not met in column C the next check would be similar in column D (Inv. no.): at least one value string starts with POL02P14 and at least one starts with POL026.

So the basic logic would be something like this:

Select all rows where B is duplicate and ((at least one value in C starts with x and one with y) or ( at least one value in D starts with z and one with W)).

So the desired output should be like this:

Creator Title Barcode Inv. No.
springer Cellbio 014678 POL02P14x
springer Cellbio 026938 POL02P26r
springer Cellbio 038745
spring Cellbit POL02P147
spring Cellbit 026938 POL02P26j
spring Cellbit 038745

Here is a sample spreadsheet more similar to the actual dataset which is fairly large:

https://docs.google.com/spreadsheets/d/1xj5LnOxIwEmcjnXD0trmvcCKJIGIcfDkARV80Hx5Fvc/edit?usp=sharing

Tried adapting formulas with similar logic but always getting errors or unexpected results either the query logic/syntax is wrong or there is filter/array dimension mismatch.

Some examples(the column references are mixed up here because i was trying to reduce the number of columns) :

=FILTER(query(list!A1:AR, "Select * where C starts with 'POL02P'"), list!B1:B<>"",COUNTIF(list!B1:B,list!B1:B)>1)
={results!A1:AR1;array_constrain(

query(
{Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26"));
countif(index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45),
index(Filter({results!A2:AR,results!AR2:AR},REGEXMATCH(results!D2:D, "^POL02P14|POL02P26")),0,45))}
,"Select * where Col46>1")
,9^9,44)}
=query(FILTER({list!A2:A&list!J2:J,list!A2:J,
   iferror(
   vlookup(list!A2:A&list!J2:J,query(query(filter(list!A2:A&
   list!J2:J,REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
   "select Col4, count(Col4) where Col4 <> '' group by Col4"),
   "select Col4 where Col2 >1 "),1,false))},REGEXMATCH(list!C2:C, "^POL02P14|POL02P26")),
   "select Col1, Col2, Col3, Col5, Col6, Col7, Col8, Col9, Col10, Col11 where Col12 <> ''
   order by Col3 asc, Col11 asc")

Solution

  • Please try this out in your sample sheet:

    ={results!A1:AR1;FILTER(results!A2:AR,REGEXMATCH(results!B2:B,JOIN("|","^"&LAMBDA(z,LAMBDA(x,y,z,{filter(filter(x,y="014"),xmatch(filter(x,y="014"),filter(x,y="026")));filter(filter(x,z="POL02P14"),xmatch(filter(x,z="POL02P14"),filter(x,z="POL02P26")))})(INDEX(z,,1),INDEX(z,,2),INDEX(z,,3)))((UNIQUE(FILTER({results!B2:B,LEFT(results!C2:C,3),LEFT(results!D2:D,8)},results!B2:B<>"",results!D2:D<>""))))&"$")))}
    

    formula logic at a glance:

    • filter Col_B (Title) in 4 ways (matches to 014, 026, POL02P14, POL02P26)
    • capture the Col_B which has both 014 and 026
    • capture the Col_B which has both POL02P14 and POL02P26
    • Shortlist the Col_B which is TRUE for either step 2 OR step 3 above
    • Once the list is finalised join them all for regexmatch with Col_B for the final output.

    enter image description here