macos apache-spark hadoop pyspark homebrew

Hadoop 3.3.0: RPC response has invalid length

I just installed PySpark via Homebrew and I'm currently trying to put stuff into Hadoop.

The Problem

Any interaction with Hadoop is failing.

I followed a tutorial to set up Hadoop3.3.0 on MacOS.

It somehow didn't work out even though the only things I fixed where some versions (specific JDK, MySQL etc).

Whenever I try to run any command related to Hadoop, I receive this:

▶ hadoop fs -ls /
2021-05-12 07:45:44,647 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
ls: RPC response has invalid length

Running this code in a notebook:

from pyspark.sql.session import SparkSession

# https://saagie.zendesk.com/hc/en-us/articles/360029759552-PySpark-Read-and-Write-Files-from-HDFS
sparkSession = SparkSession.builder.appName("example-pyspark-read-and-write").getOrCreate()
# Create data
data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 5)]
df = sparkSession.createDataFrame(data)

# Write into HDFS
df.write.csv("hdfs://localhost:9000/cluster/example.csv")
# Read from HDFS
df_load = sparkSession.read.csv("hdfs://localhost:9000/cluster/example.csv")
df_load.show()

sc.stop()

throws me

Py4JJavaError                             Traceback (most recent call last)
<ipython-input-5-e25cae5a6cac> in <module>
      8 
      9 # Write into HDFS
---> 10 df.write.csv("hdfs://localhost:9000/cluster/example.csv")
     11 # Read from HDFS
     12 df_load = sparkSession.read.csv("hdfs://localhost:9000/cluster/example.csv")

/usr/local/Cellar/apache-spark/3.1.1/libexec/python/pyspark/sql/readwriter.py in csv(self, path, mode, compression, sep, quote, escape, header, nullValue, escapeQuotes, quoteAll, dateFormat, timestampFormat, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, charToEscapeQuoteEscaping, encoding, emptyValue, lineSep)
   1369                        charToEscapeQuoteEscaping=charToEscapeQuoteEscaping,
   1370                        encoding=encoding, emptyValue=emptyValue, lineSep=lineSep)
-> 1371         self._jwrite.csv(path)
   1372 
   1373     def orc(self, path, mode=None, partitionBy=None, compression=None):

/usr/local/lib/python3.9/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1307 
   1308         answer = self.gateway_client.send_command(command)
-> 1309         return_value = get_return_value(
   1310             answer, self.gateway_client, self.target_id, self.name)
   1311 

/usr/local/Cellar/apache-spark/3.1.1/libexec/python/pyspark/sql/utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

/usr/local/lib/python3.9/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling o99.csv.
: java.io.IOException: Failed on local exception: org.apache.hadoop.ipc.RpcException: RPC response has invalid length; Host Details : local host is: "blkpingu16-MBP.fritz.box/192.xxx.xxx.xx"; destination host is: "localhost":9000; 
    at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:816)
    at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
    ...
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:979)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
    ...
    at java.base/java.lang.Thread.run(Thread.java:830)
Caused by: org.apache.hadoop.ipc.RpcException: RPC response has invalid length
    at org.apache.hadoop.ipc.Client$IpcStreams.readResponse(Client.java:1827)
    at org.apache.hadoop.ipc.Client$Connection.receiveRpcResponse(Client.java:1173)
    at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1069)

There it is: RPC response has invalid length

I have configured and verified all my paths in various config files like

core-site.xml

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
<property>
<name>ipc.maximum.data.length</name>
<value>134217728</value>
</property>
</configuration>

.zshrc

JAVA_HOME="/Library/Java/JavaVirtualMachines/jdk1.8.0_201.jdk/Contents/Home"

...

## JAVA env variablesexport JAVA_HOME="/Library/Java/JavaVirtualMachines/adoptopenjdk-8.jdk/Contents/Home"
export PATH=$PATH:$JAVA_HOME/bin

## HADOOP env variables
export HADOOP_HOME="/usr/local/Cellar/hadoop/3.3.0/libexec"
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
export HADOOP_CLASSPATH=${JAVA_HOME}/lib/tools.jar

## HIVE env variables
export HIVE_HOME=/usr/local/Cellar/hive/3.1.2_3/libexec
export PATH=$PATH:/$HIVE_HOME/bin

## MySQL ENV
export PATH=$PATH:/usr/local/Cellar/mysql/8.0.23_1/bin

Hdfs-site.xml

<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>

hadoop-env.sh

export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_141.jdk/Contents/Home

if I start Hadoop it seems to start all nodes:

▶ $HADOOP_HOME/sbin/start-all.sh
Starting namenodes on [localhost]
Starting datanodes
Starting secondary namenodes [blkpingu16-MBP.fritz.box]
2021-05-12 08:18:15,786 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting resourcemanager
Starting nodemanagers

jps shows that Hadoop stuff is running, along some Spark Stuff

▶ jps
166 Jps
99750 ResourceManager
99544 SecondaryNameNode
99851 NodeManager
98154 SparkSubmit
99405 DataNode
39326 Master

http://localhost:8088/cluster is available and shows the Hadoop dashboard (Yarn, according to the tutorial I followed) http://localhost:8080 is available and shows the Spark dashboard http://localhost:9870 is not available (should show me something Hadoop related)

My main problem is that I don't know why my namenode is not there, because it should and subsequently why I can't communicate with the HDFS in order to interact with it via command line (put data in it) or request data via notebooks. Something Hadoop related is broken and I can't figure out how to fix it.

Solution

I faced the same issue today and would like to note it here if anyone faced a similar issue. A quick command jps show me that the NameNode process is not there - although there is no warning or error show up.

As I discovered in the .log file of the NameNode in Hadoop, there was a java.net.BindException: Problem binding to [localhost:9000], which made me think that the port 9000 is used by another process. I use the command from this source to check open ports, indeed it is used by a python process (I ran only PySpark at that time). (sudo lsof -i -P -n | grep LISTEN for anyone needs by the way)

The solution is pretty straightforward: change the port number in fs.defaultFS field in etc/core-site.xml to another port that is not in used (mine is 9900).