I have some paths as strings that I would like to clean and make more neat.
Example of strings:
String 1: /keywordOneA/keywordTwoA/393r-mr49-j5n65_9e8e77g77b8
String 2: /keywordOneA/keywordTwoA/111-4444-jjjj_1b1b1b1b1b1b1b
String 3: /keywordOneA/keywordTwoB/393r-mr49-j5n65_9e8e77g77b8/keywordThreeA
I want them to yield:
String 1: /keywordOneA/keywordTwoA/
String 2: /keywordOneA/keywordTwoA/
String 3: /keywordOneA/keywordTwoB/keywordThreeA
Basically as long as it is just LETTERS between /
.../
Then I want to keep these keywords otherwise I want to remove them. Is this possible somehow?
Also possible to do this stepwise using WITH
clause.
This is what I have come up with, but starting to get stuck
select regexp_replace('/keywordonea/keywordtwob/393r-mr49 j5n65_9e8e77g77b8/keywordthreea','[0-9\/_.,!?-]','');
I think this will do what you want:
select regexp_replace(val, '/[^/]*[^a-zA-Z0-9/][^/]*', '')
I don't have Hive on-hand, but this works in Oracle.
This regular expression is looking for characters following a forward slash. These characters are not forward slashes and at least one must be non-alphanumeric. Because regular expressions are greedy by default, this will match the characters to the next forward slash or to the end of the string.