Search code examples
rubyregexrubygemsrest-client

Replace all occurrences except the first in Ruby. The regular expression spans multiple lines


I am trying to down my last 3200 tweets in groups of 200(in multiple pages) using restclient gem.

In the process, I end up adding the following lines multiple times to my file:

</statuses>
<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">

To get this right(as the XML parsing goes for a toss), after downloading the file, I want to replace all occurrences of the above string except the first. I am trying the following:

 tweets_page = RestClient.get("#{GET_STATUSES_URL}&page=#{page_number}")
      message = <<-MSG
</statuses>
<?xml version="1.0" encoding="UTF-8"?>
<statuses type="array">
MSG
      unless page_number == 1
       tweets_page.gsub!(message,"")
     end

What is wrong in the above? Is there a better way to do the same?


Solution

  • I believe it would be faster to download the whole bunch at once and split the body of your response by message and add it for the first entry. Something like this, can't try it out so consider this just as an idea.

    tweets_page = RestClient.get("#{GET_STATUSES_URL}").body
    tweets = tweets_page.split(message)
    tweets_page = tweets[0]+message+tweets[1..-1]
    

    You could easily break them up in groups of 200 like that also

    If you want to do it with a gsub on the whole text you could use the following

    tweets_page = <<-MSG
    first
    </statuses>
    <?xml version="1.0" encoding="UTF-8"?>
    <statuses type="array">
    second
    </statuses>
    <?xml version="1.0" encoding="UTF-8"?>
    <statuses type="array">
    rest
    MSG
    
    message = <<-MSG
    </statuses>
    <?xml version="1.0" encoding="UTF-8"?>
    <statuses type="array">
    MSG
    
    new_str = tweets_page.gsub message do |match|
      if defined? @first
        ""
      else
        @first = true
        message
      end
    end
    
    p new_str
    

    gives

    type=\"array\">\nrest\n"
    "first\n</statuses>\n<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<statuses type=\"array\">\nsecond\nrest\n"