Search code examples
hivepysparkapache-spark-sqlamazon-emr

Cannot have map type columns in DataFrame which calls set operations


: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map

I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.

create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>) 
location 's3://mybucket/test_insert2';

insert into test_insert2 
select distinct 'a' as test_col, map(0,0) as map_col

Solution

  • Try to convert dataframe to .rdd then apply .distinct function.

    Example:

    spark.sql("select 'a'test_col,map(0,0)map_col 
                  union all 
              select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
    

    Result:

    Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])