Search code examples
apache-sparkpysparkazure-databricks

pyspark function understanding - conversion factor


I'm coding in PySpark on Apache Spark, Databricks.

I have a DataFrame DF and the DataFrame contains the following columns [A, B, C, D, E, F, G, H, I, J].

The following validates the dataframe has the required columns

has_columns(very_large_dataframe, ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])

There is a requirement to apply conversion factor of 2.5 to Column F i.e. Value 2, conversion factor 2.5 = 5.

The full context of the code is as follows:

very_large_dataframe 250 GB of CSV files from client which must have only 10 columns [A, B, C, D, E, F, G, H, I, J], [A, B] contains string data [C, D, E, F, G, H, I, J], contains decimals with precision 5, scale 2 (i.e. 125.75) [A, B, C, D, E], should not be null [F, G, H, I, J] should may be null

very_large_dataset_location = '/Sourced/location_1'
very_large_dataframe = spark.read.csv(very_large_dataset_location, header=True, sep="\t")

validate column count

if column_count(very_large_dataframe) != 10:
        raise Exception('Incorrect column count: ' + column_count(very_large_dataframe))

validate that dataframe has all required columns

has_columns(very_large_dataframe, ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])

However, I have never come across applying a conversion factor to column.

Is anyone familiar with applying a conversion factor with PySpark? (or any language for that matter)


Solution

  • Conversion factor is the number that is multipled by, for example 2 x 2.5 = 5, so it translated the 2 is multipled by 2.5 times.

    So the conversion / multiplication factor of 2 is 2.5.

    This is my understanding.