Search code examples
pythonsqlpysparkairflowdata-engineering

pyspark statistical window function keeps calculating NULL values


Here is my table showing all the columns. What I want to basically happen is to be able to take the closing_price column and calculate a at least the moving standard deviation of about 7 days

+------+---------------+-------------+-------------+-------------+------------+-------------+------------+-------------------+----------------+----------------+------------------+-------------------+-------------------+
|volume|volume_weighted|opening_price|closing_price|highest_price|lowest_price|    timestamp|transactions|       new_datetime|moving_max_close|moving_min_close|  moving_avg_close|        range_close|moving_median_close|
+------+---------------+-------------+-------------+-------------+------------+-------------+------------+-------------------+----------------+----------------+------------------+-------------------+-------------------+
|     1|         55.388|       55.388|       55.388|       55.388|      55.388|1704067200000|           1|2024-01-01 08:00:00|          55.388|          55.388|            55.388|                0.0|             55.388|
|     1|         55.388|       55.388|       55.388|       55.388|      55.388|1704139200000|           1|2024-01-02 04:00:00|          55.388|          55.388|            55.388|                0.0|             55.388|
|   533|        55.5054|       55.405|       55.573|       55.586|      55.242|1704153600000|         533|2024-01-02 08:00:00|          55.573|          55.388|55.449666666666666|0.18500000000000227|             55.388|
|   473|        55.5836|        55.58|       55.672|       55.685|      55.445|1704168000000|         473|2024-01-02 12:00:00|          55.672|          55.388|          55.50525| 0.2839999999999989|            55.4805|
|   942|        55.6157|       55.679|       55.618|       55.683|      55.505|1704182400000|         942|2024-01-02 16:00:00|          55.672|          55.388|           55.5278| 0.2839999999999989|             55.573|
|   662|        55.6114|       55.619|       55.592|       55.662|       55.48|1704196800000|         662|2024-01-02 20:00:00|          55.672|          55.388|           55.5385| 0.2839999999999989| 55.582499999999996|
|   155|        55.5755|       55.589|       55.583|         55.6|       55.48|1704211200000|         155|2024-01-03 00:00:00|          55.672|          55.388| 55.57099999999999| 0.2839999999999989|            55.5875|
|    52|        55.5663|       55.588|       55.563|         55.6|       55.48|1704225600000|          52|2024-01-03 04:00:00|          55.672|          55.563| 55.60016666666667|0.10899999999999466|            55.5875|
|   504|        55.7318|       55.564|       55.681|       55.805|       55.55|1704240000000|         504|2024-01-03 08:00:00|          55.681|          55.563| 55.61816666666667|  0.117999999999995| 55.605000000000004|
|   798|        55.6446|       55.679|       55.583|       55.739|       55.48|1704254400000|         798|2024-01-03 12:00:00|          55.681|          55.563| 55.60333333333333|  0.117999999999995|            55.5875|
|  1012|        55.6643|       55.558|       55.751|       55.825|       54.92|1704268800000|        1012|2024-01-03 16:00:00|          55.751|          55.563| 55.62549999999999|0.18799999999999528|            55.5875|
|   684|        55.7647|       55.754|       55.802|       55.845|      55.593|1704283200000|         684|2024-01-03 20:00:00|          55.802|          55.563|55.660500000000006|0.23899999999999721|             55.632|
|   547|        55.7301|       55.815|       55.738|       55.838|        55.6|1704297600000|         547|2024-01-04 00:00:00|          55.802|          55.563| 55.68633333333333|0.23899999999999721|            55.7095|
|   184|        55.7401|       55.739|       55.673|       55.775|       55.63|1704312000000|         184|2024-01-04 04:00:00|          55.802|          55.583| 55.70466666666666| 0.2190000000000012|            55.7095|
|   639|        55.6488|       55.653|       55.568|        55.75|      55.495|1704326400000|         639|2024-01-04 08:00:00|          55.802|          55.568|55.685833333333335|0.23400000000000176|            55.7055|
|  1134|        55.5318|       55.569|       55.525|       55.615|      55.365|1704340800000|        1134|2024-01-04 12:00:00|          55.802|          55.525| 55.67616666666666|  0.277000000000001|            55.7055|
|   723|        55.4923|       55.522|       55.536|       55.565|      53.779|1704355200000|         723|2024-01-04 16:00:00|          55.802|          55.525| 55.64033333333333|  0.277000000000001|            55.6205|
|   410|        55.5717|       55.531|       55.592|       55.615|       55.44|1704369600000|         410|2024-01-04 20:00:00|          55.738|          55.525| 55.60533333333333|0.21300000000000097|              55.58|
|   138|        55.5721|       55.589|       55.574|       55.589|      55.475|1704384000000|         138|2024-01-05 00:00:00|          55.673|          55.525|            55.578|0.14800000000000324|             55.571|
|    80|        55.5543|       55.572|       55.523|         55.6|       55.52|1704398400000|          80|2024-01-05 04:00:00|          55.592|          55.523|55.553000000000004|0.06899999999999551|             55.552|
+------+---------------+-------------+-------------+-------------+------------+-------------+------------+-------------------+----------------+----------------+------------------+-------------------+-------------------+

This is my code specifying the window I'll be setting for the .over() method of the stddev() function. However whenever I do calculate it, it shows the following values

from pyspark.sql.window import Window
from pyspark.sql.functions import skewness, kurtosis, stddev, avg

# ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW which would be a 7 day moving standard deviation
window = Window.orderBy("new_datetime").rowsBetween(5, Window.currentRow)

# window = Window.orderBy("timestamp").rangeBetween(-6, 0)

result.select(skewness(result.closing_price).over(window=window).alias('skew_close')).show()

What could be the reason why the stddev() fails to calculate the moving standard deviation of the closing price values of over 7 days? Is it because of my window function? Am I missing something here?

+----------+
|skew_close|
+----------+
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
|      NULL|
+----------+

Solution

  • Use negative numbers for the rows before the current row in rowsBetween. It's NULL because there are no rows between start=5 and end=0

    window = Window.orderBy("new_datetime").rowsBetween(-5, Window.currentRow)