Search code examples
t-sqltextsubstringtext-extractioncharindex

Split question and answer text by multiple bookends


I have a field containing multiple questions and answers. I need to extract the answers into a column each. Text Example:

enter image description here

Sorry I had to add as a picture as the text kept disappearing.

I need to extract the text between the first instance of the yellow and green highlight (not including the highlighted sections) as the first line in the select clause, followed by the second instance between the yellow and green highlight as the second line in the select clause etc etc. There are 5 questions (between the pink and blue highlight) and 5 answers (between the yellow and green highlight).
I tried the code below using the text in the yellow and green highlight as bookends but I got the same error message as below.

Then I tried the following code using the question as the first bookend:

SELECT distinct subjectidname
, title
, i.description
, SUBSTRING(i.description, CHARINDEX('<b>Please indicate your company''s export status:</b><br />', i.description), 
        CHARINDEX('<br /><br />',i.description) - 
        CHARINDEX('<b>Please indicate your company''s export status:</b><br />', i.description) + Len('<br /><br />'))

from FilteredIncident i

Both efforts resulted in an error message:

Msg 537, Level 16, State 3, Line 2 Invalid length parameter passed to the LEFT or SUBSTRING function.

And it also does not account for the 2nd, 3rd, 4th & 5th instances. What is the best way to extract the 5 answers from the description box containing a single line of text?


Solution

  • Start with a string splitter that can split on a string and returns an index for each row:

    CREATE FUNCTION [dbo].[DelimitedSplit8K]
    --===== Define I/O parameters
            (@pString VARCHAR(8000), @pDelimiter VARCHAR(16))
    --WARNING!!! DO NOT USE MAX DATA-TYPES HERE!  IT WILL KILL PERFORMANCE!
    RETURNS TABLE WITH SCHEMABINDING AS
     RETURN
    --===== "Inline" CTE Driven "Tally Table" produces values from 1 up to 10,000...
         -- enough to cover VARCHAR(8000)
      WITH E1(N) AS (
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
                     SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                    ),                          --10E+1 or 10 rows
           E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
           E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max
     cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                         -- for both a performance gain and prevention of accidental "overruns"
                     SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                    ),
    cteStart(N1) AS (--==== This returns N+1 (starting position of each "element" just once for each delimiter)
                     SELECT 1 UNION ALL
                     SELECT t.N+ Len( @pDelimiter ) FROM cteTally t WHERE SUBSTRING(@pString,t.N, Len( @pDelimiter ) ) = @pDelimiter
                    ),
    cteLen(N1,L1) AS(--==== Return start and length (for use in substring)
                     SELECT s.N1,
                            ISNULL(NULLIF(CHARINDEX(@pDelimiter,@pString,s.N1),0)-s.N1 ,8000)
                       FROM cteStart s
                    )
    --===== Do the actual split. The ISNULL/NULLIF combo handles the length for the final element when no delimiter is found.
     SELECT ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
            Item       = SUBSTRING(@pString, l.N1, l.L1)
       FROM cteLen l;
    

    (Credit to Jeff Moden for years of successful string splitting.)

    Then pick the right substrings to split on:

    declare @QandA as NVarChar(1000) = '<b>Q1:</b><br />A1<br /><br /><b>Q2:</b><br />A2<br /><br /><b>Q3:</b><br />A3<br /><br /><b>Q4:</b><br />A4<br /><br />';
    
    -- A single split gets Q/A pairs:
    select ItemNumber, Item
      from dbo.DelimitedSplit8K( @QandA, '<br /><br />' )
      order by ItemNumber;
    
    -- A second split gets Q's and A's:
    with QAPairs as (
      select ItemNumber as QuestionNumber, Item as QA
        from dbo.DelimitedSplit8K( @QandA, '<br /><br />' ) )
      select QuestionNumber, QA, ItemNumber, Item, case when ItemNumber % 2 = 1 then 'Q' else 'A' end as 'Q/A'
        from QAPairs cross apply
          dbo.DelimitedSplit8K( QA, '<br />' );
    

    dbfiddle.

    That ought to be a good start. There is a bit of cleanup to do, e.g. there is a spurious empty Q/A pair since the string ends with a '<br /><br />' which, as a delimiter, must mean there is a Q/A pair on each side.


    This example retrieves the data from a table a breaks down each row into its component questions and answers:

    -- Sample data.
    declare @QandAs as Table ( QandAId Int Identity, QandA NVarChar(1000) );
    insert into @QandAs ( QandA ) values
      ( '<b>Q1a:</b><br />A1a<br /><br /><b>Q2a:</b><br />A2a<br /><br /><b>Q3a:</b><br />A3a<br /><br /><b>Q4a:</b><br />A4a<br /><br />' ),
      ( '<b>Q1b:</b><br />A1b<br /><br /><b>Q2b:</b><br />A2b<br /><br /><b>Q3b:</b><br />A3b<br /><br /><b>Q4b:</b><br />A4b<br /><br />' );
    select * from @QandAs;
    
    -- A single split gets Q/A pairs:
    with QAPairs as (
      select QandAId, ItemNumber, Item, Row_Number() over ( partition by QandAId order by ItemNumber desc ) as RN
        from @QandAs cross apply
          dbo.DelimitedSplit8K( QandA, '<br /><br />' ) )
      select QandAId, ItemNumber, Item, RN
        from QAPairs
        where RN > 1 -- Eliminate the extraneaous empty Q/A pair at the end of the string.
        order by QandAId, ItemNumber;
    
    -- A second split gets Q's and A's:
    with QAPairs as (
      select QandAId, ItemNumber as QuestionNumber, Item as QA, Row_Number() over ( partition by QandAId order by ItemNumber desc ) as RN
        from @QandAs cross apply
          dbo.DelimitedSplit8K( QandA, '<br /><br />' ) )
      select QandAId, QuestionNumber, QA, ItemNumber, Item, case when ItemNumber % 2 = 1 then 'Q' else 'A' end as 'Q/A'
        from QAPairs cross apply
          dbo.DelimitedSplit8K( QA, '<br />' )
        where RN > 1 -- Eliminate the extraneaous empty Q/A pair at the end of the string.
        order by QandAId, QuestionNumber, ItemNumber;
    

    dbfiddle.