Search code examples
sqlsql-servert-sqletldata-cleaning

Transfer and cleansing data from a html column in SQL Server to other relational table


I have a table with HTML data Like below :

SELECT 
    [ID],
    [title],
    [authors_html],
    [authors_text]
FROM 
    [wiley].[dbo].[library]

Sample of authors_html column data :

<div class="accordion-tabbed__tab-mobile accordion__closed">
   <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A" class="author-name accordion-tabbed__control" data-id="a1" data-db-target-for="a1" aria-controls="a1" aria-haspopup="true" id="a1_Ctrl" role="button"><span>M. A. PRICE</span><i aria-hidden="true" class="icon-section_arrow_d"></i></a>
   <div class="author-info accordion-tabbed__content" data-db-target-of="a1" aria-labelledby="a1_Ctrl" role="region" id="a1">
      <p>Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K.</p>
      <a class="moreInfoLink" href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A">Search for more papers by this author</a>
   </div>
</div>
<div class="accordion-tabbed__tab-mobile accordion__closed">
   <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G" class="author-name accordion-tabbed__control" data-id="a2" data-db-target-for="a2" aria-controls="a2" aria-haspopup="true" id="a2_Ctrl" role="button"><span>C. G. ARMSTRONG</span><i aria-hidden="true" class="icon-section_arrow_d"></i></a>
   <div class="author-info accordion-tabbed__content" data-db-target-of="a2" aria-labelledby="a2_Ctrl" role="region" id="a2">
      <p>Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K.</p>
      <a class="moreInfoLink" href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G">Search for more papers by this author</a>
   </div>
</div>

I need to transfer this column data to Researcher table :

ID Full_Name Email Tel URL Address Country Ostan University Madrak Field org Fax
1007 M. A. PRICE NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K. U.K. NULL The Queen's University of Belfast NULL NULL NULL
1008 C. G. ARMSTRONG NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K. U.K. NULL The Queen's University of Belfast NULL NULL NULL
1009 B. BOROOMAND NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=BOROOMAND%2C+B Institute for Numerical Methods in Engineering, University of Wales, Swansea, U.K. U.K. University of Wales NULL NULL NULL
1010 O. C. ZIENKIEWICZ NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=ZIENKIEWICZ%2C+O+C Institute for Numerical Methods in Engineering, University of Wales, Swansea, U.K. U.K. University of Wales NULL NULL NULL
1011 ZHAO‐PING JIAO NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=JIAO%2C+ZHAO-PING Department of Civil Engineering, South China Construction University, (West Campus), 510405 Guangzhou, China China NULL South China Construction University NULL NULL NULL
1012 THEODORE H. H. PIAN NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=PIAN%2C+THEODORE+H+H Department of Aeronautics and Astronautics, Massachussets Institute of Technology, Cambridge, MA, U.S.A. U.S.A. NULL NULL NULL
1013 SHENG YONG NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=YONG%2C+SHENG Department of Modern Mechanics, University of Science and Technology of China, Hei Fei, China China University of Science and Technology of China NULL NULL NULL
1014 MIN‐WEI HUANG NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=HUANG%2C+MIN-WEI Optimal Design Laboratory, College of Engineering, The University of Iowa, Iowa City, IA 52242, U.S.A. U.S.A. Iowa The University of Iowa NULL NULL NULL
1015 JASBIR S. ARORA NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=ARORA%2C+JASBIR+S Optimal Design Laboratory, College of Engineering, The University of Iowa, Iowa City, IA 52242, U.S.A. U.S.A. Iowa The University of Iowa NULL NULL NULL
1016 C. S. TSAI NULL http://185.141.105.238/action/doSearch?ContribAuthorStored=TSAI%2C+C+S Department of Civil Engineering, Feng Chia University, Taichung, Taiwan, Republic of China Republic of China Feng Chia University NULL NULL NULL

I tried to use xml potential for separate part of data( Note: Add a single-cot in below cod manually):

DECLARE @HtmlTbl TABLE (ID INT IDENTITY, Html XML)

INSERT INTO @HtmlTbl(Html) VALUES('<div class="accordion-tabbed__tab-mobile accordion__closed">
   <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A" class="author-name accordion-tabbed__control" data-id="a1" data-db-target-for="a1" aria-controls="a1" aria-haspopup="true" id="a1_Ctrl" role="button"><span>M. A. PRICE</span><i aria-hidden="true" class="icon-section_arrow_d"></i></a>
   <div class="author-info accordion-tabbed__content" data-db-target-of="a1" aria-labelledby="a1_Ctrl" role="region" id="a1">
      <p>Department of Mechanical and Manufacturing Engineering, The Queen'+'s University of Belfast, Belfast BT95AH, U.K.</p>
      <a class="moreInfoLink" href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A">Search for more papers by this author</a>
   </div>
</div>
<div class="accordion-tabbed__tab-mobile accordion__closed">
   <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G" class="author-name accordion-tabbed__control" data-id="a2" data-db-target-for="a2" aria-controls="a2" aria-haspopup="true" id="a2_Ctrl" role="button"><span>C. G. ARMSTRONG</span><i aria-hidden="true" class="icon-section_arrow_d"></i></a>
   <div class="author-info accordion-tabbed__content" data-db-target-of="a2" aria-labelledby="a2_Ctrl" role="region" id="a2">
      <p>Department of Mechanical and Manufacturing Engineering, The Queen'+'s University of Belfast, Belfast BT95AH, U.K.</p>
      <a class="moreInfoLink" href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G">Search for more papers by this author</a>
   </div>
</div>
COPY TO CLIPBOARD SELECT ALL')


--  SELECT
--    Html.query('//div')
--FROM @HtmlTbl 


SELECT
    C.value('(.)[1]', 'varchar(1000)')
FROM @HtmlTbl
CROSS APPLY Html.nodes('//div') AS T(C)

My destination table is Researcher and Company :

CREATE TABLE [dbo].[Researcher]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Full_Name] [nvarchar](50) NULL,
    [Email] [nvarchar](100) NULL,
    [Tel] [nvarchar](20) NULL,
    [URL] [nvarchar](max) NULL,
    [Address] [nvarchar](max) NULL,
    [Country] [nvarchar](100) NULL,
    [Ostan] [nvarchar](100) NULL,
    [University] [nvarchar](100) NULL,
    [Madrak] [nvarchar](100) NULL,
    [Field] [nvarchar](100) NULL,
    [org] [nvarchar](250) NULL,
    [Fax] [nvarchar](20) NULL
)

CREATE TABLE [dbo].[Company]
(
    [ID] [int] IDENTITY(1,1) NOT NULL,
    [Title] [nvarchar](255) NULL,
    [Type] [nvarchar](100) NULL,
    [Country] [nvarchar](100) NULL,
    [City] [nvarchar](100) NULL,
    [Address] [nvarchar](max) NULL,
    [Tel] [nvarchar](100) NULL,
    [Fax] [nvarchar](100) NULL,
    [PostCode] [nvarchar](20) NULL
)

I need transfer data from authors_html column to Company and Researcher and cleaning data.

If you need connect to sample database use this connection :

IP: 185.141.105.232
user: wiley
pass: wiley
DB: wiley

Solution

  • Here is your starting point.

    It is for SQL Server 2016 onwards.

    SQL

    DECLARE @HtmlTbl TABLE (ID INT IDENTITY, Html XML);
    INSERT INTO @HtmlTbl(Html) VALUES('<div class="accordion-tabbed__tab-mobile accordion__closed">
            <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A"
               class="author-name accordion-tabbed__control" data-id="a1"
               data-db-target-for="a1" aria-controls="a1" aria-haspopup="true"
               id="a1_Ctrl" role="button">
                <span>M. A. PRICE</span>
                <i aria-hidden="true" class="icon-section_arrow_d"></i>
            </a>
            <div class="author-info accordion-tabbed__content"
                 data-db-target-of="a1" aria-labelledby="a1_Ctrl" role="region"
                 id="a1">
                <p>Department of Mechanical and Manufacturing Engineering, The Queen''s University of Belfast, Belfast BT95AH, U.K.</p>
                <a class="moreInfoLink"
                   href="http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A">Search for more papers by this author</a>
            </div>
        </div>
        <div class="accordion-tabbed__tab-mobile accordion__closed">
            <a href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G"
               class="author-name accordion-tabbed__control" data-id="a2"
               data-db-target-for="a2" aria-controls="a2" aria-haspopup="true"
               id="a2_Ctrl" role="button">
                <span>C. G. ARMSTRONG</span>
                <i aria-hidden="true" class="icon-section_arrow_d"></i>
            </a>
            <div class="author-info accordion-tabbed__content"
                 data-db-target-of="a2" aria-labelledby="a2_Ctrl" role="region"
                 id="a2">
                <p>Department of Mechanical and Manufacturing Engineering, The Queen''s University of Belfast, Belfast BT95AH, U.K.</p>
                <a class="moreInfoLink"
                   href="http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G">Search for more papers by this author</a>
            </div>
        </div>');
    
    -- INSERT INTO dbo.Researcher (Full_Name, [URL], [Address], University, Country) -- uncommemnt when you are ready
    SELECT ID
        , c.value('(a/span/text())[1]', 'nvarchar(50)') AS Full_Name
        , c.value('(div/a/@href)[1]', 'nvarchar(max)') AS [URL]
        , c.value('(div/p/text())[1]', 'nvarchar(max)') AS [Address]
        , JSON_VALUE(x,'$[1]') AS University
        , JSON_VALUE(x,'$[3]') AS Country
        -- continue with the rest
    FROM @HtmlTbl
    CROSS APPLY Html.nodes('/div') AS t(c)
    CROSS APPLY (VALUES ('["' + REPLACE(c.value('(div/p/text())[1]', 'nvarchar(max)'),',','","') + '"]')) AS t2(x);
    

    Output

    +----+-----------------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+------------------------------------+---------+
    | ID |    Full_Name    |                                     URL                                     |                                                     Address                                                     |             University             | Country |
    +----+-----------------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+------------------------------------+---------+
    |  1 | M. A. PRICE     | http://185.141.105.238/action/doSearch?ContribAuthorStored=PRICE%2C+M+A     | Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K. |  The Queen's University of Belfast |  U.K.   |
    |  1 | C. G. ARMSTRONG | http://185.141.105.238/action/doSearch?ContribAuthorStored=ARMSTRONG%2C+C+G | Department of Mechanical and Manufacturing Engineering, The Queen's University of Belfast, Belfast BT95AH, U.K. |  The Queen's University of Belfast |  U.K.   |
    +----+-----------------+-----------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------+------------------------------------+---------+