Search code examples
dataframemapreducejulia

Use mapreduce to read CSVs (where not all columns match) and combine into DataFrame


I'm using Julia 1.4.2.

I want to use mapreduce() to:

  1. Read a bunch of CSVs, then

  2. Combine them into one big DataFrame.

First the preliminaries:

using CSV, DataFrames

# Create CSVs
df1 = DataFrame([['a', 'b', 'c'], [1, 2, 3]],
                ["name", "id"])
df2 = DataFrame([['d', 'e', 'f'], [4, 5, 6]],
                ["name", "id"])
# NOTE: This df has an extra column not present in the other two
df3 = DataFrame([['x', 'y', 'z'], [7, 8, 9], [11, 22, 33]],
                ["name", "id", "num"])
CSV.write("df1.csv", df1)
CSV.write("df2.csv", df2)
CSV.write("df3.csv", df3)

# Get Vector of file paths for the above-created CSVs.
# Regex because there might be other files in working directory.
files = filter(x -> occursin(r"df\d\.csv$", x),
               readdir(join=true))

If I call map() and reduce() separately, I get what I want:

# Import the above-created CSVs as a Vector of DataFrames
dfs = map(x -> CSV.File(x) |> DataFrame,
          files)

# Combine them into one big DataFrame
df = reduce(vcat, dfs, cols=:union)

(NOTE: df3 has an extra column not present in the other two, so I need the cols=:union argument.)

However, I want to condense the above map() and reduce() calls into a mapreduce() call. Here's what I've tried:

df = mapreduce(x -> CSV.File(x) |> DataFrame,
               x -> vcat(x, cols=:union),
               files)
# MethodError: no method matching (::var"#16#18")(::DataFrame, ::DataFrame)

df = mapreduce(x -> CSV.File(x) |> DataFrame,
               vcat,
               files,
               cols=:union)
# MethodError: no method matching _mapreduce_dim(::var"#21#22", ::typeof(vcat), ::NamedTuple{(:cols,),Tuple{Symbol}}, ::Array{String,1}, ::Colon)

The root of my problem is that I don't understand the documentation for mapreduce(). How can I pass named arguments to the binary function (the op argument)? E.g., I can add the cols=:union argument to reduce(op, itr), as in reduce(vcat, dfs, cols=:union). How can I pass arguments to the binary function op in mapreduce(f, op, itrs...)?


Solution

  • op must be a two-argument function since it combines the current state with the newly mapped element. Try this:

    df = mapreduce(x -> CSV.File(x) |> DataFrame,
                   (x, y) -> vcat(x, y; cols=:union),
                   files)