Search code examples
pythondataframeazure

AMBIGUOUS_REFERENCE error when trying to aggregate a dataframe in azure


Have a dataframe with the following columns year, month, loc_code, usg_type, id_code, usg

trying to aggregate in SQL it would have been

select year, month, loc_code, usg_type, count(distinct id_code) as id_count, sum(usg) as traffic
group by all

In Python I've tried the following:

    step4_df = (
        step3_df
            .groupBy("year", "month", "loc_code", "usg_type")
            .agg(
                countDistinct("id_code").alias("id_count"),
                sum("usg").alias("traffic")
            )
    )

But I'm getting a AnalysisException: [AMBIGUOUS_REFERENCE] Reference id_code is ambiguous error. Followed by a list of suggested tables that have nothing to do with I am trying to sum/count.

How do I go about resolving it? Thank you


Solution

  • The error message you mentioned,

    AnalysisException: [AMBIGUOUS_REFERENCE] Reference id_code

    Usually When performing a join operation on columns that have the same name in both DataFrames, it is necessary to disambiguate the join condition by specifying the column name as a sequence using Seq("join_column_name"). This helps to avoid any ambiguity and ensures that the join is performed correctly.

    is ambiguous, occurs when there is confusion about which column to reference, possibly due to similar column names in joined DataFrames or a mismatch in DataFrame references.

    I have tried the below approach:

    columns = ["year", "month", "loc_code", "usg_type", "id_code", "usg"]
    spark = SparkSession.builder.appName("example").getOrCreate()
    step3_df = spark.createDataFrame(data, schema=columns)
    step4_df = (
        step3_df
            .groupBy("year", "month", "loc_code", "usg_type")
            .agg(
                countDistinct("id_code").alias("id_count"),
                sum("usg").alias("traffic")
            )
    )
    step4_df.show()
    

    Results:

    +----+-----+--------+--------+--------+-------+
    |year|month|loc_code|usg_type|id_count|traffic|
    +----+-----+--------+--------+--------+-------+
    |2021|    2|    LOC2|   TypeA|       1|    250|
    |2021|    2|    LOC2|   TypeB|       1|    450|
    |2021|    1|    LOC1|   TypeA|       2|    250|
    |2021|    1|    LOC1|   TypeB|       1|    200|
    +----+-----+--------+--------+--------+-------+