Have a dataframe with the following columns year, month, loc_code, usg_type, id_code, usg
trying to aggregate in SQL it would have been
select year, month, loc_code, usg_type, count(distinct id_code) as id_count, sum(usg) as traffic
group by all
In Python I've tried the following:
step4_df = (
step3_df
.groupBy("year", "month", "loc_code", "usg_type")
.agg(
countDistinct("id_code").alias("id_count"),
sum("usg").alias("traffic")
)
)
But I'm getting a AnalysisException: [AMBIGUOUS_REFERENCE] Reference id_code
is ambiguous error. Followed by a list of suggested tables that have nothing to do with I am trying to sum/count.
How do I go about resolving it? Thank you
The error message you mentioned,
AnalysisException: [AMBIGUOUS_REFERENCE] Reference id_code
Usually When performing a join operation on columns that have the same name in both DataFrames, it is necessary to disambiguate the join condition by specifying the column name as a sequence using Seq("join_column_name")
. This helps to avoid any ambiguity and ensures that the join is performed correctly.
is ambiguous, occurs when there is confusion about which column to reference, possibly due to similar column names in joined DataFrames or a mismatch in DataFrame references.
I have tried the below approach:
columns = ["year", "month", "loc_code", "usg_type", "id_code", "usg"]
spark = SparkSession.builder.appName("example").getOrCreate()
step3_df = spark.createDataFrame(data, schema=columns)
step4_df = (
step3_df
.groupBy("year", "month", "loc_code", "usg_type")
.agg(
countDistinct("id_code").alias("id_count"),
sum("usg").alias("traffic")
)
)
step4_df.show()
Results:
+----+-----+--------+--------+--------+-------+
|year|month|loc_code|usg_type|id_count|traffic|
+----+-----+--------+--------+--------+-------+
|2021| 2| LOC2| TypeA| 1| 250|
|2021| 2| LOC2| TypeB| 1| 450|
|2021| 1| LOC1| TypeA| 2| 250|
|2021| 1| LOC1| TypeB| 1| 200|
+----+-----+--------+--------+--------+-------+