I have to read in data in Databricks Python/PySpark but the format is not the usual CSV or JSON so I have to iterate over a for loop. As a result it's very slow.
The data looks like this, for millions of rows. It's not the same format each row, although there are certain common formats:
HEADER0123 a bunch of spaces ACCTNUM999787666 more numbers ABC2XYZ some text
So to parse I read from s3 as text and get the data by character position:
raw_text = (spark.read
.format("text")
.option("mode", "PERMISSIVE")
.option("header", "false")
.option("inferSchema","false")
.load(my_path))
my_list = []
input = raw_text.collect()
for row in input:
line = row[0].strip()
header = line[0:6]
acct = line[6:9]
my_list.append(header, acct)
etc.
Then later I create dataframes:
df = spark.createDataFrame(my_list, "header string, acct int")
Even though I have experience with Spark dataframes this is the only way I can think of due to the unusual format.
Is there a way to leverage Spark to process this kind of data? Or a way that doesn't require a for loop?
you're looking for substring()
my_list = []
input = raw_text.collect()
for row in input:
line = row[0].strip()
header = line[0:6]
acct = line[6:9]
my_list.append(header, acct)
df = spark.createDataFrame(my_list, "header string, acct int")
is same as
df = (
raw_text
.withColumn('header', F.substring('value', 0, 6))
.withColumn('acct', F.substring('value', 7, 3))
.drop('value')
)
Also note that if each line in your input file is fixed length and has header and account fields separated by space then you can still read it as a csv.
spark.read.option("delimiter", " ").csv(file)
# or
spark.read.csv(file, sep=' ')