Search code examples
ruby-on-railsrubyruby-on-rails-3ruby-on-rails-3.2rexml

Illegal character '&' in raw string REXML parsing


Hi am trying to parse an XML file using REXML .... when there is an illegal character in my XML file ...its jus fails at this point.

So is there any way we could replace or remove these kind of characters ?

fails to parse with the error Illegal character '&' in raw string REXML parsing

<head> Negative test for underlying BJSPRICEENG N4&N5
</head>


doc = REXML::Document.new(File.open(file_name,"r:iso-8859-1:utf-8"))

testfile.elements["head"].text





doc = REXML::Document.new(content)
dir_path = doc.elements["TestBed/TestDir"].attributes["path"].to_s
    doc.elements.each("TestBed/TestDir") do |directory|
      directory.elements.each("file") do |testfile|

t= testfile.elements["head"].text

end
end
end




<file name="toptstocksensbybjs.m">
      <MCheck></MCheck>
      <TestExtension></TestExtension>
      <TestType></TestType>


<fcn name="lvlTwoDocExample" linenumber="20">
 <head> P1><&
</head>

 </fcn>

   </file>

Solution

  • For your case, to remove the illegal & characters, you may try:

    content = File.open(file_name,"r:iso-8859-1:utf-8").read
    content.gsub!(/&(?!(?:amp|lt|gt|quot|apos);)/, '&amp;')
    doc = REXML::Document.new(content)
    

    However, for those other illegal characters, especially those unpaired <, >, ' or ", it will be much more difficult.