Search code examples
sqlregexhiveregexp-replace

HIVE - Removing everything between certain characters


I have some paths as strings that I would like to clean and make more neat.

Example of strings:
String 1: /keywordOneA/keywordTwoA/393r-mr49-j5n65_9e8e77g77b8
String 2: /keywordOneA/keywordTwoA/111-4444-jjjj_1b1b1b1b1b1b1b
String 3: /keywordOneA/keywordTwoB/393r-mr49-j5n65_9e8e77g77b8/keywordThreeA

I want them to yield:
String 1: /keywordOneA/keywordTwoA/
String 2: /keywordOneA/keywordTwoA/
String 3: /keywordOneA/keywordTwoB/keywordThreeA

Basically as long as it is just LETTERS between /.../ Then I want to keep these keywords otherwise I want to remove them. Is this possible somehow?

Also possible to do this stepwise using WITHclause.

This is what I have come up with, but starting to get stuck

select regexp_replace('/keywordonea/keywordtwob/393r-mr49 j5n65_9e8e77g77b8/keywordthreea','[0-9\/_.,!?-]','');

Solution

  • I think this will do what you want:

    select regexp_replace(val, '/[^/]*[^a-zA-Z0-9/][^/]*', '')
    

    I don't have Hive on-hand, but this works in Oracle.

    This regular expression is looking for characters following a forward slash. These characters are not forward slashes and at least one must be non-alphanumeric. Because regular expressions are greedy by default, this will match the characters to the next forward slash or to the end of the string.