I have a query that counts all of the words in one column and gives me the frequency and frequency rank of the word as the result. For some reason, I keep getting a row that has no word in it. How do I get rid of it?
Table:
CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT);
Insert:
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo foo');
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar');
Query:
select *
from (
select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
from dummy d
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
group by created_day, word
) d
where created_day = CURRENT_DATE and word IS NOT NULL
order by rn
LIMIT 10;
Returns:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | | 2 | 2
2020-12-18 00:00:00+00 | arm | 1 | 3
2020-12-18 00:00:00+00 | squared | 1 | 3
I'd like to get rid of the blank word:
created_day | word | cnt | rn
------------------------+---------+-----+----
2020-12-18 00:00:00+00 | foo | 4 | 1
2020-12-18 00:00:00+00 | arm | 1 | 2
2020-12-18 00:00:00+00 | squared | 1 | 3
The problem is with the inner regexp_replace()
; when the matching part is at the end of the string, you end up with a trailing space at the end of the strings. Basically, when applied to 'foo bar'
, it generates 'foo '
.
Then when parsed, this generates a final word whose value is the empty string (''
).
A simple workaround is to trim()
the output of regexp_replace()
, so basically replace:
cross join lateral regexp_split_to_table(
regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g'),
'\s+'
) w(word)
With:
cross join lateral regexp_split_to_table(
trim(regexp_replace(tweet, '\y(rt|co|https|bar|none)\y', '', 'g')),
'\s+'
) w(word)