Search code examples
dataframefilterjuliasubstring

Find rows in Julia dataframe with substrings in several columns


I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.

I found the following answer which solves the first of three aspects of my issue: Finding rows in a Julia dataframe where substrings are NOT found

Example extended with more columns:

using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])

What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with new version of the Excel file.

df2 = filter(r -> any(occursin.(["e"], [r.col3,r.col4])), df1)

2.) Select only rows that contain more than one occurance of the substring. That's what I tried:

df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)

which returns only an empty DataFrame.
empty DataFrame(https://i.sstatic.net/MSxBp.png)

Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind coming from the "old" programming language world using column names contained in the Excel file as designators in the script source code feels odd.


Solution

  • You most likely want this:

    julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
    2×4 DataFrame
     Row │ col1   col2    col3    col4
         │ Int64  String  String  String
    ─────┼───────────────────────────────
       1 │     1  ab      de      de
       2 │    24  de      de      de
    
    julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
    1×4 DataFrame
     Row │ col1   col2    col3    col4
         │ Int64  String  String  String
    ─────┼───────────────────────────────
       1 │    24  de      de      de
    

    Explanation:

    • r[Between(:col2, :col4)] keeps all columns between columns :col2 and :col4 (including them); in the second example I included columns only up to :col3 which drops one row as expected
    • contains("e") returns a function taking a string and returning true if this string contains "e", example contains("e")("abc") is false and contains("e")("abec") is true
    • count counts number of time the condition function (typically called predicate) passed as a first argument returns true