Search code examples
pandaspysparkcluster-analysisrounding

pulling a value from a spark dataframe without it rounding the value


I'm attempting to verify something in my spark dataset. I'm taking a set of columns I'm using inside a clustering algorithms, producing a feature column, then normalizing the data. Spark does all this just fine. However, I noticed that when I convert the results of the filtering of my data into a pandas dataframe, spark's toPandas, head, and first functions all round the values. This means they're hard to plot and useless to any functions outside of spark. Is there a way to shut this off?

So I do the following:

    assembler_t = VectorAssembler(inputCols=['unixTime','value_hrf','value_raw'],           
    outputCol="features",handleInvalid="keep")
    result_t = assembler_t.transform(result)
    normalizer_t=Normalizer(inputCol='features', outputCol='normalized_features')
    result_t=normalizer_t.transform(result_t)

The result of result_t.select('normalized_features').show() is a table that looks like this:

    | normalized_features|
    |--------------------|
    |[0.99999999999980...|
    |[0.99999999999980...|
    |[0.99999999999980...|
    |[0.99999999999979...|

You see the first problem. the values are so close together, large amounts of decimals are required. Also, there's three values per row and only one is being shown. So I figured I'd look at just one: result_t.first()['normalized_features']

Unfortunately, this produces the following: DenseVector([1.0, -0.0, -0.0])

The same thing happens if I use the head or toPandas functions. All I'm trying to figure out is if the values produced are unique and if I can get them output with their full decimal places. The built-in KMeans algorithm seems to produce an output that indicates they're unique, but I want to be sure. Also, since I'm using algorithms that don't exist in spark as well, I need to output the pre-run filtered data into a pandas DF to use it. When I do this, the normalized data is useless, so I have to use something like a min-max scalar to get something similar. However, I'm worried this will introduce bias in the data since normalization from algorithm to algorithm isn't the same.


Solution

  • You can print the full pandas dataframe without any column truncation using the following function.

    def print_pandas(dataframe_given):
        with pd.option_context('display.max_rows', None,'display.max_columns', None, 'expand_frame_repr', False, 'display.max_colwidth', None):
            print("Given dataframe name")
            print(dataframe_given)