I'm attempting to verify something in my spark dataset. I'm taking a set of columns I'm using inside a clustering algorithms, producing a feature column, then normalizing the data. Spark does all this just fine. However, I noticed that when I convert the results of the filtering of my data into a pandas dataframe, spark's toPandas, head, and first functions all round the values. This means they're hard to plot and useless to any functions outside of spark. Is there a way to shut this off?
So I do the following:
assembler_t = VectorAssembler(inputCols=['unixTime','value_hrf','value_raw'],
outputCol="features",handleInvalid="keep")
result_t = assembler_t.transform(result)
normalizer_t=Normalizer(inputCol='features', outputCol='normalized_features')
result_t=normalizer_t.transform(result_t)
The result of result_t.select('normalized_features').show()
is a table that looks like this:
| normalized_features|
|--------------------|
|[0.99999999999980...|
|[0.99999999999980...|
|[0.99999999999980...|
|[0.99999999999979...|
You see the first problem. the values are so close together, large amounts of decimals are required. Also, there's three values per row and only one is being shown. So I figured I'd look at just one: result_t.first()['normalized_features']
Unfortunately, this produces the following:
DenseVector([1.0, -0.0, -0.0])
The same thing happens if I use the head or toPandas functions. All I'm trying to figure out is if the values produced are unique and if I can get them output with their full decimal places. The built-in KMeans algorithm seems to produce an output that indicates they're unique, but I want to be sure. Also, since I'm using algorithms that don't exist in spark as well, I need to output the pre-run filtered data into a pandas DF to use it. When I do this, the normalized data is useless, so I have to use something like a min-max scalar to get something similar. However, I'm worried this will introduce bias in the data since normalization from algorithm to algorithm isn't the same.
You can print the full pandas dataframe without any column truncation using the following function.
def print_pandas(dataframe_given):
with pd.option_context('display.max_rows', None,'display.max_columns', None, 'expand_frame_repr', False, 'display.max_colwidth', None):
print("Given dataframe name")
print(dataframe_given)