Dear stackoverflow community,
This is my first time asking a question here. Hope you could cut me some slack. Here is the description of a problem:
ogr2ogr -f CSV output.csv 'some KML file'.kml
data = pd.read_csv('output.csv')
In[1]: data.head(8)
Out[1]:
description ID
0 <div class="googft-info-window">\n<b>ID:</b> 1... 1
1 <div class="googft-info-window">\n<b>ID:</b> 1... 10
2 <div class="googft-info-window">\n<b>ID:</b> 1... 100
3 <div class="googft-info-window">\n<b>ID:</b> 1... 1000
4 <div class="googft-info-window">\n<b>ID:</b> 1... 10000
5 <div class="googft-info-window">\n<b>ID:</b> 1... 10001
6 <div class="googft-info-window">\n<b>ID:</b> 1... 10002
7 <div class="googft-info-window">\n<b>ID:</b> 1... 10003
<div class=""googft-info-window"">
<b>ID:</b> 1<br>
<b>class:</b> 1<br>
<b>fold:</b> 5
</div>
My question is how can I extract clean information from the 'description' column and add it to the same row using all the <b>...<\b>
as a column names.
Thank you. P.S. Please let me know how I can improve my questions in the future.
You could use str.extractall
with...
df[['ID1', 'class', 'fold']] = df['description'].str.extractall(r'</b>\s?(\d+)<').unstack()
Or str.findall
with something like this...
df[['ID1', 'class', 'fold']] = df['description'].str.findall(r'</b>\s?(\w+)<', expand=True)
These are based on a string like...
<div class=""googft-info-window""><b>ID:</b> 1<br><b>class:</b> 1<br><b>fold:</b> 5</div>
With the regex matching the three values you need from the string being extracted and passed to the three new columns on the left of the =
sign.
Regex: https://regex101.com/r/wVl2cI/1
This is assuming you only have three values to find and the HTML is all the same.
Outputs (example):
description ID1 class fold
0 <div class=""googft-.. 1 1 1
1 <div class=""googft-.. 1 1 1
2 <div class=""googft-.. 5 5 5