Search code examples
cassandraapache-sparkdatastaxdatastax-enterprise

Enable Spark on Same Node As Cassandra


I am trying to test out Spark so I can summarize some data I have in Cassandra. I've been through all the DataStax tutorials and they are very vague as to how you actually enable spark. The only indication I can find is that it comes enabled automatically when you select "Analytics" node during install. However, I have an existing Cassandra node and I don't want to have to use a different machine for testing as I am just evaluating everything on my laptop.

Is it possible to just enable Spark on the same node and deal with any performance implications? If so how can I enable it so that it can be tested?

I see the folders there for Spark (although I'm not positive all the files are present) but when I check to see if it's set to Spark master, it says that no spark nodes are enabled.

dsetool sparkmaster

I am using Linux Ubuntu Mint.

I'm just looking for a quick and dirty way to get my data averaged and so forth and Spark seems like the way to go since it's a massive amount of data, but I want to avoid having to pay to host multiple machines (at least for now while testing).


Solution

  • Yes, Spark is also able to interact with a cluster even if it is not on all the nodes.

    Package install

    Edit the /etc/default/dse file, and then edit the appropriate line 
    to this file, depending on the type of node you want:
    ...
    
    Spark nodes:
    SPARK_ENABLED=1
    HADOOP_ENABLED=0
    SOLR_ENABLED=0
    

    Then restart the DSE service

    http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseServ.html

    Tar Install

    Stop DSE on the node and the restart it using the following command

    From the install directory:
    ...
    Spark only node: $ bin/dse cassandra -k - Starts Spark trackers on a cluster of Analytics nodes.
    

    http://docs.datastax.com/en/datastax_enterprise/4.5/datastax_enterprise/reference/refDseStandalone.html