Search code examples
htmlpandaskml

Extract HTML information from df variable


Dear stackoverflow community,

This is my first time asking a question here. Hope you could cut me some slack. Here is the description of a problem:

  1. I convert KML file to CSV using ogr2org
    ogr2ogr -f CSV output.csv 'some KML file'.kml
  2. I then read in the csv file in pandas
    data = pd.read_csv('output.csv')
  1. After deleting some bizarre columns I end up having the following:
In[1]: data.head(8)
Out[1]: 
                                          description   ID
0   <div class="googft-info-window">\n<b>ID:</b> 1...   1
1   <div class="googft-info-window">\n<b>ID:</b> 1...   10
2   <div class="googft-info-window">\n<b>ID:</b> 1...   100
3   <div class="googft-info-window">\n<b>ID:</b> 1...   1000
4   <div class="googft-info-window">\n<b>ID:</b> 1...   10000
5   <div class="googft-info-window">\n<b>ID:</b> 1...   10001
6   <div class="googft-info-window">\n<b>ID:</b> 1...   10002
7   <div class="googft-info-window">\n<b>ID:</b> 1...   10003
  1. The representative string under description variable is as follows:
    <div class=""googft-info-window"">
    <b>ID:</b> 1<br>
    <b>class:</b> 1<br>
    <b>fold:</b> 5
    </div>

My question is how can I extract clean information from the 'description' column and add it to the same row using all the <b>...<\b> as a column names.

Thank you. P.S. Please let me know how I can improve my questions in the future.


Solution

  • You could use str.extractall with...

    df[['ID1', 'class', 'fold']] = df['description'].str.extractall(r'</b>\s?(\d+)<').unstack()
    

    Or str.findall with something like this...

    df[['ID1', 'class', 'fold']] = df['description'].str.findall(r'</b>\s?(\w+)<', expand=True)
    

    These are based on a string like...

    <div class=""googft-info-window""><b>ID:</b> 1<br><b>class:</b> 1<br><b>fold:</b> 5</div>
    

    With the regex matching the three values you need from the string being extracted and passed to the three new columns on the left of the = sign.

    Regex: https://regex101.com/r/wVl2cI/1

    This is assuming you only have three values to find and the HTML is all the same.

    Outputs (example):

        description             ID1    class    fold
    0   <div class=""googft-..  1      1        1
    1   <div class=""googft-..  1      1        1
    2   <div class=""googft-..  5      5        5