I have a dataset that contains some rows with invalid entries in one column.
I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.
Pseudo code might look something like this:
SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)
I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.
The python code would look like this:
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
Which would result in something like this:
df
a b c
/bobs/burgers/1234 bobs idx
/bobs/burgers/2234 fred idx
/cindys/sandwhiches/3234 cindy idx
df = df[df.apply(lambda x: x.name in x.url, axis=1)]
a b c
/bobs/burgers/1234 bobs idx
/cindys/sandwhiches/3234 cindy idx
Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?
In addition to like
and rlike
, you can use these functions:
Using instr
:
select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;
Using locate
:
select distinct b,c from some.log where date = current_date and locate(b, a) > 0;
See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF