Search code examples
pythonregexline-breaks

Python Regex sub doesn't replace line breaks


I have a robot that brings me an html code like this:

<div class="std">
  <p>CAR:
    <span>Onix</span>
  </p>
  <p>MODEL: LTZ</p>
  <p>
    <span>COLOR:
    <span>Black</span>
  </p>
  <p>ACESSORIES:
    <span>ABS</span>
  </p>
  <p>
    <span>DESCRIPTION:</span>
    <span>The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.</span>
  </p>
  <p>TECHNICAL DETAIL:
    <span>The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..</span>
  </p>
</div>

I applied the code below to remove the HTML tags:

cleanr    = re.compile('<.*?>')
cleantext = re.sub(cleanr,'\n', html_code).strip()

It returns to me:

CAR: Onix


MODEL: LTZ


COLOR:
Black



ACESSORIES:
ABS



DESCRIPTION:


The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.



TECHNICAL DETAIL:
The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..

Now I need to remove the line breaks to have something like this:

CAR: Onix
MODEL: LTZ
COLOR: Black
ACESSORIES: ABS
DESCRIPTION: The Chevrolet Onix is a subcompact car launched by American automaker Chevrolet in Brazil at the 2012 São Paulo International Motor Show[1] to succeed some versions of Chevrolet Celta. Offered initially as a five-door hatchback, a four-door sedan was launched in 2013 and called the Chevrolet Prisma.[2] The Onix is currently only sold in some South American countries part of Mercosur, including Brazil, Argentina, Colombia, Paraguay and Uruguay.
TECHNICAL DETAIL: The Onix is available in three trim levels (LS, LT and LTZ) with two 4-cylinder engines, the 1.0-litre producing 78 PS (57 kW; 77 bhp) (petrol)/ 80 PS (59 kW; 79 bhp) (ethanol) and 1.4-litre 98 PS (72 kW; 97 bhp) (petrol)/106 PS (78 kW; 105 bhp) (ethanol) offering automatic or five-speed manual transmission..

I tried this code below, but it doesn't match the line breaks correctly:

cleantext = re.sub(r':\s*[\r\n]*', ': ', cleantext)

I tried this another code also:

cleantext = cleantext.replace(': \n', ': ')

It doesn't work also. How can I manage this?


Solution

  • I think there are two parts to your problem, First is to Join the string in two lines like below
    COLOR: Black

    to
    COLOR: black

    and then remove all empty lines

    For first part you could use replace your re.sub with following
    cleantext = re.sub(r'(.*):\s*[\r\n](.*)', '\g<1>: \g<2>', cleantext)

    And for removing empty lines it would be tricky to do it via re.sub so I would suggest to use cleantext = "\n".join([line for line in cleantext.split('\n') if line.strip() != ''])

    This would give you answer as expected