I need help creating database tables from html tables. Right now I'm just manually doing it. I have a html doc that has all the data but I don't know what the best way to extract the data.
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ATOMS Definition for Type tom.service.soc.SocRecord</title>
<style type="text/css">
body
{
line-height: 1.6em;
font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
font-size: 14px;
margin: 45px;
}
#box-table-a
{
font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
font-size: 12px;
margin: 5%;
width: 90%;
text-align: left;
border-collapse: collapse;
}
#box-table-a th
{
font-size: 13px;
font-weight: normal;
padding: 8px;
background: #b9c9fe;
border-top: 4px solid #aabcfe;
border-bottom: 1px solid #fff;
color: #039;
}
#box-table-a td
{
padding: 8px;
background: #e8edff;
border-bottom: 1px solid #fff;
color: #669;
border-top: 1px solid transparent;
}
#box-table-a tr:hover td
{
background: #d0dafd;
color: #339;
}
</style>
</head>
<body>
<table id="box-table-a" summary="Definition for tom.service.soc.SocRecord">
<thead>
<tr><th colspan="2">tom.service.soc.SocRecord</th></tr>
</thead>
<tbody>
<tr>
<td>Version</td>
<td>1</td>
</tr>
<tr>
<td>Description</td>
<td>[type is UNCLASSIFIED] Temporary dummy test object for SOC</td>
</tr>
</tbody>
</table>
<table id="box-table-a" summary="Fields Definition for Type tom.service.soc.SocRecord">
<thead>
<tr>
<th scope="col">Index</th>
<th scope="col">Name</th>
<th scope="col">Type</th>
<th scope="col">Range</th>
<th scope="col">Default</th>
<th scope="col" width="50%">Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>socID</td>
<td>String</td>
<td>
-
</td>
<td>""</td>
<td>
[ ] The UUID of the tracked object -- String for transmission purposes
</td>
</tr>
<tr>
<td>2</td>
<td>satID</td>
<td><a href="../../../../../tom/state/vcm/SatNumberType.html">SatNumberType</a></td>
<td>
</td>
<td></td>
<td>
[ ] The ID of the tracked object -- copy of the satelliteId in the VCM
</td>
</tr>
</tbody>
</table>
</body></html>
Here is the html. I would like help making a create postgres database table script like this from the html. And if it has a href then that links to another table.
CREATE TABLE soc.SocRecord(
socId TEXT, --[ ] The UUID of the tracked object -- String for transmission purposes
satId UUID, --[ ] The ID of the tracked object -- copy of the satId in the VCM
commonName TEXT, --[ ] The name of the tracked object -- may be blank -
--This field is optional in the current version of the message, check the set attribute before use.);
Edit
Tried out some more variants and found that using zip
resulted in no errors and returned the sql script.
for header, value in zip(headers, values):
I've updated the code bellow as well.
So your approach could constitute something like:
read .html
file
parse for table
parse for theader
and so on
I find idea pretty interesting, so I tried it out in python.
from bs4 import BeautifulSoup
# Specify the path to your HTML file
html_file_path = 'path/to/your/file.html'
here you could write another script to read all .html
files in a folder and execute against that instead.
# Read the contents of the HTML file
with open(html_file_path, 'r') as file:
html = file.read()
# Find all the tables in the HTML
tables = soup.find_all('table')
# Iterate over the tables
for table in tables:
# Find the table's ID attribute
table_id = table.get('id')
# Extract the table headers
headers = [th.get_text() for th in table.find('thead').find_all('th')]
# Create a dictionary to store the table data
table_data = {}
# Iterate over the table rows
for row in table.find('tbody').find_all('tr'):
# Extract the row cells
cells = row.find_all('td')
# Extract the cell values
values = [cell.get_text().strip() for cell in cells]
# Store the values with their corresponding headers in the dictionary
for header, value in zip(headers, values):
if header not in table_data:
table_data[header] = []
table_data[header].append(value)
# Generate the PostgreSQL table script
create_table_script = f"CREATE TABLE {table_id} (\n"
for header, values in table_data.items():
# Handle column names with spaces or special characters
column_name = header.lower().replace(' ', '_').replace('.', '_')
# Combine the column values into a comma-separated string
column_values = ', '.join([f"'{value}'" if isinstance(value, str) else str(value) for value in values])
# Append the column definition to the script
create_table_script += f" {column_name} {column_values},\n"
create_table_script = create_table_script.rstrip(',\n') + "\n);\n"
# Print the table script
print(create_table_script)
The result of the above is probably not going to be exactly what you want, but it gets you started.