Search code examples
dataframematplotlibjuliacompatibilityremoving-whitespace

Plotting Julia DataFrame columns that have whitespace in their names with Matplotlib


I have DataFrames that have whitespace in their column names, because the CSV files they were generated from had whitespace in the names as well. The DataFrames were generated with the lines

csvnames::Array{String,1} = filter(x -> endswith(x, ".csv"), readdir(CSV_DIR))
dfs::Dict{String, DataFrame} = Dict( csvnames[i] => CSV.File(CSV_DIR * csvnames[i]) |> DataFrame for i in 1:length(csvnames))

The DataFrames have column names such as "Tehtävä 1", but none of the following expressions work when I try to access the column (here ecols is a dataframe):

  1. plot = axes.plot(ecols[Symbol("Tehtävä 1")]) produces the error TypeError("float() argument must be a string or a number, not 'PyCall.jlwrap'")

  2. plot = axes.plot(ecols[:Tehtävä_1]) produces the error ERROR: LoadError: ArgumentError: column name :Tehtävä_1 not found in the data frame; existing most similar names are: :Tehtävä 1

  3. plot = axes.plot(ecols[:Tehtävä 1]) raises the error ERROR: LoadError: MethodError: no method matching typed_hcat(::DataFrame, ::Symbol, ::Int64)

It therefore seems that I have no way of plotting DataFrame columns that have spaces in their names. Printing them works just fine, as the line

println(ecols[Symbol("Tehtävä 1")])

produces and array of floats: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], which it is supposed to. Is Matplotlib just incompatible with DataFrames with whitespace in their column names and if it is, how could I remove all whitespace from the columns of a Julia DataFrame?

EDIT

I forgot to mention one very crucial point: the DataFrame contains missing values, which Matplotlib can't comprehend. This was causing error 1. I would still very much like to know if there is a way of getting rid of any whitespace in the table column names, possibly during the construction of the DataFrame.


Solution

  • The first approach works just fine, but it seems you are not using PyPlot.jl correctly (in particular you try to create a variable called plot which will overshadow plot function from PyPlot.jl).

    To see that it works run:

    julia> df = DataFrame(Symbol("Tehtävä 1") => 1.0:5.0)
    5×1 DataFrame
    │ Row │ Tehtävä 1 │
    │     │ Float64   │
    ├─────┼───────────┤
    │ 1   │ 1.0       │
    │ 2   │ 2.0       │
    │ 3   │ 3.0       │
    │ 4   │ 4.0       │
    │ 5   │ 5.0       │
    
    julia> plot(df[Symbol("Tehtävä 1")])
    1-element Array{PyCall.PyObject,1}:
     PyObject <matplotlib.lines.Line2D object at 0x000000003F9EE0B8>
    

    and a plot is shown as expected.

    EDIT

    If you want to remove whitespace from column names of data frame df write:

    names!(df, Symbol.(replace.(string.(names(df)), Ref(r"\s"=>""))))