Search code examples
sqlstringhivehiveql

Hive select rows where string in column A contains string from column B


I have a dataset that contains some rows with invalid entries in one column.

I'm trying to select all distinct rows from column B and C, but column B has invalid entries. The string in column A contains the correct name for column B, so if I can figure out how to select all rows where str(b) is in str(a), then I should end up with only correct data.

Pseudo code might look something like this:

SELECT DISTINCT b,c FROM some.log WHERE date = 'today' AND str(b) IN str(a)

I know how to use Python to fix the issue, but column A is unique on every line. So if I select distinct rows and include column A, I'm essentially just selecting all rows, which results in a 60gb file.

The python code would look like this:

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

Which would result in something like this:

df

a                         b        c
/bobs/burgers/1234        bobs     idx 
/bobs/burgers/2234        fred     idx
/cindys/sandwhiches/3234  cindy    idx

df = df[df.apply(lambda x: x.name in x.url, axis=1)]

a                         b        c
/bobs/burgers/1234        bobs     idx
/cindys/sandwhiches/3234  cindy    idx

Is it possible to do this filtering with Hive, so that there is no need to download the large files and process with python?


Solution

  • In addition to like and rlike, you can use these functions:

    Using instr:

    select distinct b,c from some.log where date = current_date and a instr(a,b) > 0;
    

    Using locate:

    select distinct b,c from some.log where date = current_date and locate(b, a) > 0;
    

    See this for reference: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF