ruby-on-rails ruby encoding utf-8 latin1

Ruby on Rails - Stripping strange characters out of body in rake task

On my company's Rails website, we have a Twitter area where tweets from our social media team are displayed by a rake task. Basically the rake task uses the Twitter gem to import any new tweets into the database on a regular basis, and displays them from there. URL links in the tweet are converted to HTML links using the auto_link helper.

Always works fine, until now. All of the sudden, the links are broken and even wrongly highlighting the word right before the URL link. So in an example tweet that should look like this: "Please be safe St. Louis. Heat warning extended through August http://bit.ly/...", the word August is linked and the URL itself that follows is broken, as if there was something in between the last word and link breaking it...

Investigated the helpers, looked in the database for the tweet's text field to see if there was anything strange, even used the rails console to manually pull up the tweets, but everything looked okay. It wasn't until I went all the way into the tweet body's hex code that I saw...

Please be safe S
t. Louis. Heat w
arning extended 
through¬†August.
¬†http://bit.ly/
r5fXlz #heatpoca
lypse

So the culprit was that ¬† being thrown into the space, when I deleted the culprit space and readded it manually in the database, the issue cleared up.

The only problem is, I don't understand why the tweet body is being imported like that, especially when it looks fine via the Rails console. As this is an older database, I noticed it was still using latin1 encoding in some areas with utf8 in others, and I was certain that converting all of that to UTF-8 would fix it, but it did not.

I went as far as tried to use a sanitation helper on the body before being imported, but that didn't work either.

Also tried a ruby gsub to strip the ¬† out, but it didn't work.

Does anyone have any insight on how to solve this odd problem?

Solution

I was finally able to solve this by running the following specifically on the body string in the rake task...

Iconv.conv('ASCII//TRANSLIT', 'UTF8', tweet.body)

Odd, but it works. More information on using the above can be found here: ruby (1.8.7): How to get rid of non-printable chars while scraping?