Search code examples
pythonpysparkcondaazure-machine-learning-service

How to install and use mmlspark on a local machine with Conda Python?


How to install and use MMLSpark on a local machine with Intel Python 3.6?

import numpy as np
import pandas as pd
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "Azure:mmlspark:0.13") \
            .getOrCreate()

import mmlspark
from mmlspark import TrainClassifier
from pyspark.ml.classification import LogisticRegression
from mmlspark import ComputeModelStatistics, TrainedClassifierModel


dataFilePath = "AdultCensusIncome.csv"
import os, urllib
if not os.path.isfile(dataFilePath):
    urllib.request.urlretrieve("https://mmlspark.azureedge.net/datasets/" + dataFilePath, dataFilePath)
data = spark.createDataFrame(pd.read_csv(dataFilePath, dtype={" hours-per-week": np.float64}))
data = data.select([" education", " marital-status", " hours-per-week", " income"])
train, test = data.randomSplit([0.75, 0.25], seed=123)
train.limit(10).toPandas()

model = TrainClassifier(model=LogisticRegression(), labelCol=" income", numFeatures=256).fit(train)
prediction = model.transform(test)
metrics = ComputeModelStatistics().transform(prediction)
metrics.limit(10).toPandas()

MMLSpark does not work. Could someone help with this issue?


Solution

  • Your question doesn't describe problem correctly, but if you are looking for installation commands. then please see below,

    install pyspark first.

    pip install pyspark
    

    To install MMLSpark on an existing HDInsight Spark Cluster, you can execute a script action on the cluster head and worker nodes. For instructions on running script actions, see this guide.

    The script action url is: https://mmlspark.azureedge.net/buildartifacts/0.13/install-mmlspark.sh.

    If you're using the Azure Portal to run the script action, go to Script actions → Submit new in the Overview section of your cluster blade. In the Bash script URI field, input the script action URL provided above. Mark the rest of the options as shown on the screenshot to the right.

    Submit, and the cluster should finish configuring within 10 minutes or so.

    from Original Docs:- https://github.com/Azure/mmlspark