Search code examples
azureazure-data-factorykqlazure-data-explorer

How to use output from a source in where clause on DataFlow ADF


I have a dataflow that my source is a Kusto query. This source has over 100M rows which I am partition using a hash Kusto function from a ForEach activity.

In my Kusto query, one of the where clauses is to filter on the SubscriptionId field from a blob that is under my ADLSv2.

But, instead of using that solution hardcoded on my code, I would like to get those subscriptions from another Kusto query as another source (image below), but without using the lookup or join steps, because I need to filter the data in the source directly instead of projecting and then filtering. Otherwise, it might break/too much memory, etc.

So, is it possible to filter in my main Kusto query in a where clause from the results of ANOTHER kusto source (aka CPCDevices)?

This is my dataflow:

enter image description here

This is my main Kusto query as example:

Table
| project column1, column2, column3
| where SubscriptionId in ({CPCDevices})
//| another things on my query here..

The CPCDevices from the where clause is coming from the CPCDecices source

UPDATE

After adding another source, this is the dataflow updated with the QueryOutput derived columns as suggested in this solution.

enter image description here


Solution

  • You can try Cache sink here. But I haven't tested this for large data.

    This is my Table1 sample data:

    Id  Name    Age
    1   Rakesh  22
    2   Laddu   22
    3   Virat   35
    4   ABD     40
    5   MSD     42
    

    Table2 sample data:

    Table2Id    Marks
    1           94
    3           100
    5           98
    

    Here I am projecting the columns Id,Name from Table1 with a condition of Id in Table2Id.

    First add a cache sink to the Table2 source.

    enter image description here

    Now, use the cache sink outputs array in the source query of Table1 like below.

    "Table1| project Id,Name| where Id in (dynamic({map(sink1#outputs(),#item.Table2Id)}))"
    

    enter image description here

    Make sure your write order is 1 for the cache sink.

    enter image description here

    Result of Table1 source:

    enter image description here

    If the above approach is not working for the large data, then joins might be the best option here. First join the two tables in source KQL script itself and then project from the result table.