I'm new to Julia (and StackOverflow).
I want to extract rows, originally from an Excel file, that contain a substring in more than one column.
I found the following answer which solves the first of three aspects of my issue: Finding rows in a Julia dataframe where substrings are NOT found
Example extended with more columns:
using DataFrames
df1 = DataFrame(col1=[1,62,13,24], col2=["ab","bc","cd","de"], col3=["de","bc","cd","de"], col4=["de","bc","cd","de"])
What I additionally like to do is:
1.) How do I specify a (consecutive) range of columns to look through ("second column":"last used column")?
I can list the columns explicitly, but I would like to mention them relatively. I can't rely on the names to be the same with new version of the Excel file.
df2 = filter(r -> any(occursin.(["e"], [r.col3,r.col4])), df1)
2.) Select only rows that contain more than one occurance of the substring. That's what I tried:
df2 = filter(r -> (sum(occursin.(["e"], r.col2)) > 1), df1)
which returns only an empty DataFrame.
empty DataFrame(https://i.sstatic.net/MSxBp.png)
Is a DataFrame the "Julian" way to represent my Excel file at all? In my mind coming from the "old" programming language world using column names contained in the Excel file as designators in the script source code feels odd.
You most likely want this:
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col4)]) > 1, df1)
2×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 1 ab de de
2 │ 24 de de de
julia> df2 = filter(r -> count(contains("e"), r[Between(:col2, :col3)]) > 1, df1)
1×4 DataFrame
Row │ col1 col2 col3 col4
│ Int64 String String String
─────┼───────────────────────────────
1 │ 24 de de de
Explanation:
r[Between(:col2, :col4)]
keeps all columns between columns :col2
and :col4
(including them); in the second example I included columns only up to :col3
which drops one row as expectedcontains("e")
returns a function taking a string and returning true
if this string contains "e"
, example contains("e")("abc")
is false
and contains("e")("abec")
is true
count
counts number of time the condition function (typically called predicate) passed as a first argument returns true