Search code examples
apache-sparkapache-spark-sqlapache-spark-mllibapache-spark-mlapache-spark-dataset

Java Spark. VectorAssembler dont accept String and Null


I have a big project with Spark using Java. I read a csv file with more than 1.000.000 rows and one column is a String.

When i try to do a VectorAssembler to use a ML algorith i have an error because the column "Moon" is a String.

So im trying to transform this String to a Integer with this:

Dataset<Row> moons = typedMoons.withColumn("Moon", typedMoons.col("Moon").cast("Integer"));

But when i do this i get a Null value in that column.

So im trying to na.fill() with Java:

        Dataset<Row> typedMoonsfinal = typedMoons.na().fill("Moon", typedMoons.col("Moon"));

But im not using well fill().

Some recommendations to pull this problem or other ways?

Thank so much and regards.


Solution

  • You can't just convert a string into an int, unless it's the string representation of a number, like "1234". "Moon" is not a number.

    What you need to do is use a StringIndexer to create a correlation between your string labels and a number. Iif you pass your string column through a StringIndexer, it will create a new integer column with the same value for every string in the original column. So all the rows with the "Moon" value will have for example the value 1 in the new column, or all the rows the the "Sun" value will have for example the value 2 in the new column.

    You can use this new integer column in your VectorAssembler.