Search code examples
dataframecsvapache-sparkapache-spark-mlliblibsvm

Can I use a CSV in Spark MLLib?


I'm new to using Spark's MLLib Python API. I have my data in CSV format like so:

Label   0   1   2   3   4   5   6   7   8   9   ... 758 759 760 761 762 763 764 765 766 767
0   -0.168307   -0.277797   -0.248202   -0.069546   0.176131    -0.152401   0.12664 -0.401460   0.125926    0.279061    ... -0.289871   0.207264    -0.140448   -0.426980   -0.328994   0.328007    0.486793    0.222587    0.650064    -0.513640
3   -0.313138   -0.045043   0.279587    -0.402598   -0.165238   -0.464669   0.09019 0.008703    0.074541    0.142638    ... -0.094025   0.036567    -0.059926   -0.492336   -0.006370   0.108954    0.350182    -0.144818   0.306949    -0.216190
2   -0.379293   -0.340999   0.319142    0.024552    0.142129    0.042989    -0.60938    0.052103    -0.293400   0.162741    ... 0.108854    -0.025618   0.149078    -0.917385   0.110629    0.146427

Can I use this as is by loading it using df = spark.read.format("csv").option("header", "true").load("file.csv")? I'm attempting to train a Random Forest model. I've tried researching it, but it doesn't seem to be a big topic. I don't want to just attempt it without being fully sure it would work because the cluster I use has long queue times.


Solution

  • Yes! You'll want to infer the schema too.

    df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("file.csv")
    

    If you have many files with the same column names and data types, save the schema to reuse.

    schema = df.schema
    

    And then next time you read a csv file with the same columns, you can

    df = spark.read.format("csv").option("header", "true").option("schema", schema).load("file.csv")