pyspark load data from url
url = "https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv"
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
However, the following error occurred:
spark.read.csv(SparkFiles.get("cities.csv"), header=True)
[Stage 0:>
(0 + 1) / 1]20/06/30 19:10:57 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: File /tmp/spark-1ee8b00f-8657-4cdc-8d7b-e3bc473bbce7/userFiles-f9e0a88d-8678-48c4-a21b-c06ce76d528b/cities.csv exists and does not match contents of https://github.com/jokecamp/FootballData/blob/master/openFootballData/cities.csv
20/06/30 19:10:57 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 499, in csv
return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/usr/local/lib/python3.6/dist-packages/py4j/java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/jsh2936/spark-3.0.0-preview2-bin-hadoop2.7/python/pyspark/sql/utils.py", line 98, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.6/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o31.csv.``
How should i solve the problem?
The problem is with your url..
In order to read data from github you have to pass the raw
url instead.
On the data page click on raw and then copy that url to get the data
url = 'https://raw.githubusercontent.com/jokecamp/FootballData/master/openFootballData/cities.csv'
from pyspark import SparkFiles
spark.sparkContext.addFile(url)
df = spark.read.csv(SparkFiles.get("cities.csv"), header=True)