Search code examples
htmlpostgresqlhtml-tabledatabase-table

Create database tables from html tables


I need help creating database tables from html tables. Right now I'm just manually doing it. I have a html doc that has all the data but I don't know what the best way to extract the data.

<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>ATOMS Definition for Type tom.service.soc.SocRecord</title>
<style type="text/css">
body
{
    line-height: 1.6em;
    font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
    font-size: 14px;
    margin: 45px;
}
#box-table-a
{
    font-family: "Lucida Sans Unicode", "Lucida Grande", Sans-Serif;
    font-size: 12px;
    margin: 5%;
    width: 90%;
    text-align: left;
    border-collapse: collapse;
}
#box-table-a th
{
    font-size: 13px;
    font-weight: normal;
    padding: 8px;
    background: #b9c9fe;
    border-top: 4px solid #aabcfe;
    border-bottom: 1px solid #fff;
    color: #039;
}
#box-table-a td
{
    padding: 8px;
    background: #e8edff;
    border-bottom: 1px solid #fff;
    color: #669;
    border-top: 1px solid transparent;
}
#box-table-a tr:hover td
{
    background: #d0dafd;
    color: #339;
}
</style>

</head>
<body>
<table id="box-table-a" summary="Definition for tom.service.soc.SocRecord">
    <thead>
        <tr><th colspan="2">tom.service.soc.SocRecord</th></tr>
    </thead>
    <tbody>
        <tr>
            <td>Version</td>
            <td>1</td>
        </tr>
        <tr>
            <td>Description</td>
            <td>[type is UNCLASSIFIED] Temporary dummy test object for SOC</td>
        </tr>
    </tbody>
</table>
<table id="box-table-a" summary="Fields Definition for Type tom.service.soc.SocRecord">
    <thead>
        <tr>
            <th scope="col">Index</th>
            <th scope="col">Name</th>
            <th scope="col">Type</th>
            <th scope="col">Range</th>
            <th scope="col">Default</th>
            <th scope="col" width="50%">Description</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>1</td>
            <td>socID</td>
            <td>String</td>
            <td>           
-
            </td>
            <td>""</td>
            <td>
                [ ] The UUID of the tracked object -- String for transmission purposes  
            </td>
        </tr>
        <tr>
            <td>2</td>
            <td>satID</td>
            <td><a href="../../../../../tom/state/vcm/SatNumberType.html">SatNumberType</a></td>
            <td>
            </td>
            <td></td>
            <td>
                [ ] The ID of the tracked object -- copy of the satelliteId in the VCM  
            </td>
        </tr>
    </tbody>
</table>

</body></html>

Here is the html. I would like help making a create postgres database table script like this from the html. And if it has a href then that links to another table.

CREATE TABLE soc.SocRecord( 
    socId TEXT, --[ ] The UUID of the tracked object -- String for transmission purposes
    satId UUID, --[ ] The ID of the tracked object -- copy of the satId in the VCM
    commonName TEXT, --[ ] The name of the tracked object -- may be blank - 
                     --This field is optional in the current version of the message, check the set attribute before use.);

Solution

  • Edit Tried out some more variants and found that using zip resulted in no errors and returned the sql script.

    for header, value in zip(headers, values):

    I've updated the code bellow as well.


    So your approach could constitute something like:

    read .html file parse for table parse for theader and so on

    I find idea pretty interesting, so I tried it out in python.

    from bs4 import BeautifulSoup
    
    # Specify the path to your HTML file
    html_file_path = 'path/to/your/file.html'
    

    here you could write another script to read all .html files in a folder and execute against that instead.

    # Read the contents of the HTML file
    with open(html_file_path, 'r') as file:
        html = file.read()
    
    # Find all the tables in the HTML
    tables = soup.find_all('table')
    
    # Iterate over the tables
    for table in tables:
        # Find the table's ID attribute
        table_id = table.get('id')
    
        # Extract the table headers
        headers = [th.get_text() for th in table.find('thead').find_all('th')]
    
        # Create a dictionary to store the table data
        table_data = {}
    
        # Iterate over the table rows
        for row in table.find('tbody').find_all('tr'):
            # Extract the row cells
            cells = row.find_all('td')
    
            # Extract the cell values
            values = [cell.get_text().strip() for cell in cells]
    
            # Store the values with their corresponding headers in the dictionary
            for header, value in zip(headers, values):
                if header not in table_data:
                    table_data[header] = []
                table_data[header].append(value)
    
        # Generate the PostgreSQL table script
        create_table_script = f"CREATE TABLE {table_id} (\n"
        for header, values in table_data.items():
            # Handle column names with spaces or special characters
            column_name = header.lower().replace(' ', '_').replace('.', '_')
    
            # Combine the column values into a comma-separated string
            column_values = ', '.join([f"'{value}'" if isinstance(value, str) else str(value) for value in values])
    
            # Append the column definition to the script
            create_table_script += f"    {column_name} {column_values},\n"
    
        create_table_script = create_table_script.rstrip(',\n') + "\n);\n"
    
        # Print the table script
        print(create_table_script)
    
    

    The result of the above is probably not going to be exactly what you want, but it gets you started.