Search code examples
pythonbeautifulsouphtml-parsing

Python - beautifulsoup changes attribute positioning


Hi I am trying to parse html code I am attaching a few line of html

<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">

When I load this into beautifulsoup it changes attributes position in alphabetic order like code below

<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>

You can see difference initially rel was before href after just loading and write file again order of attributes changes.

Is there any way to prevent this from happening. Thanks


Solution

  • From the documentation, you can use custom HTMLFormatter:

    from bs4 import BeautifulSoup
    from bs4.formatter import HTMLFormatter
    
    
    txt = '''<link rel="stylesheet" href="assets/css/fontawesome-min.css">
    <link rel="stylesheet" href="assets/css/bootstrap.min.css">
    <link rel="stylesheet" href="assets/css/xsIcon.css">'''
    
    class UnsortedAttributes(HTMLFormatter):
        def attributes(self, tag):
            for k, v in tag.attrs.items():
                yield k, v
    
    soup = BeautifulSoup(txt, 'html.parser')
    
    #before HTMLFormatter
    print( soup )
    
    print('-' * 80)
    
    #after HTMLFormatter
    print( soup.encode(formatter=UnsortedAttributes()).decode('utf-8') )
    

    Prints:

    <link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
    <link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
    <link href="assets/css/xsIcon.css" rel="stylesheet"/>
    --------------------------------------------------------------------------------
    <link rel="stylesheet" href="assets/css/fontawesome-min.css"/>
    <link rel="stylesheet" href="assets/css/bootstrap.min.css"/>
    <link rel="stylesheet" href="assets/css/xsIcon.css"/>