Search code examples
sql-serverparsingselectfindcharindex

SQL Parse out multiple substrings


I have a very long and complex string that comes in with New line breaks - I am having a hard time parsing. I need to be able to create a select query with a column for each of the below fields.

The ideal would be find the new line break - for each line - go back to the : everything before the colon should be the name of the column, and everything between : and new ling break should be the data in the field.

All the data is returned as strings so i'm just building a select statement for each of the below lines. I'm not sure if that is at all possible.

The second alternative, to hard code and say something like CHARINDEX ( 'Home Phone:' ,notes, 0) Where I find the home phone string and then pull everything between the : and the new ling break after specifying the string.

In this case each select item in my query will say - find string "Home Phone" and pull what comes after the colon, or find string "School Name" etc.

This is what the data looks like (in one all string called notes):

Home Phone: 1234567890  
Cell Phone: 1234567890  
Date of Birth: 01/01/1971 
School Name: James Jones High  School 
Address:123 Main Street 
School City: Queens  
School State: PA  
School Zip: 32112 
Years Teaching: 12  
Grade Levels: Middle School  
Total Students: 120  
Subject: Music:   
How did they hear:  Other, provide more info: Former partner teacher in the Middle School 
Type: Public/Charter   
Question 1: aaaaaaaa aaaaaaaaaaaaaaaaa aaaaaaaaaaaaaa aaaaaaaaaaaaaa aaaaaaaaa aaaaaaa aaaa aaa aaaaaaaa aaaaaa aaaaaaaa  aaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaa aaaaaaa aaaaaaaa aaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaa aaaaaa aaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aaaaaaaa aaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaa aaaaaaaaaaaaaaa aaaaaaaaaaa aaaaaaaa aaaaaaaaaaaaa aaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaa aaaaaa aaaaaa aaaaaaa aaaaaaaa aaaaaaaaaaaaaa aaaaaaaaaaa aaaaa aaaaaa aaaaaa aaaaaaaaaaaa aaaaaaaaaaaa aaa aaaa aaaaa aaaaaaaaaa aaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaa aaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaaa aaaaaaaaaaa aaaaaaaaa aaaaaaaaaaaa.   
Question 2: bbb bbbbbbbbbb bbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbb bbbbbbbbbbbb bbbbbbb bbb bbbbbbbbbb bbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbb bbb bbbbbbbbb bbbbbbb bbbbbb bbbbbb bbbbbbb  bbb bbbbbbbbbb bbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbb bbbbbbbbbbbb bbbbbbb bbbbbbbbbbbb bbbbbbb bbbbbbbbbb bbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbb bbbbbbbbbbbb bbbbbbbbbbbb bbbbbbb bbb bbbbbbbbbb bbbbbbbbbbbbbbbbb bbbbbbbbbbbbbb 
Question 3: ccccccccccccccccccccccc cccccccc ccccccccccc cccccccccccccccccccccc ccc ccccccccc cccccccccccccc ccccccccccccccccccccc cccccccccccccccccccccc cccccccccccccccccc ccccccccccc ccccccccccccc ccccccccccccccccc cccccccc

so the output looks something like this (with all of the long questions answered as well in each field).

Home Phone  Cell Phone  Date of Birth:  …   Type:               Question 1 :                Question 2:    Question 3: 
1234567890  1234567890  1/1/1971            Public/Charter      aaaaaaaa aaaaaaaaaaaaa.     bbb bbbbbbbbbb ccccccccccccccccccccccc 

I'm not sure if that makes sense -- but any and all suggestions are really appreciated.

Code to pull the substring and the new line char -- but this is hard coded. I can't figure out how to do it dynamically.

SELECT  ltrim(rtrim(CHARINDEX ( 'Home Phone:' ,notes, 0) + LEN('Home Phone: '))) as 'beggining',
        ltrim(rtrim(CHARINDEX ( CHAR(10) ,notes, 0)))   as 'ending',
        SUBSTRING(notes,(CHARINDEX ( 'Home Phone:' ,notes, 0) + LEN('Home Phone: ')),(LEN('Home Phone: '))) as 'home phone',    
FROM    table a 

Thank you!


Solution

  • A lot of this credit (90%) should go to Alex K who provided an indepth answer regarding finding the nth occurrence of a character

    SQL Server - find nth occurrence in a string

    I took that answer, adjusted it for your question and then applied a PIVOT to break it into the desired rows/columns. This method should be able to create the desired output for as many unique question sets as you need provided they always have the same logic (each question/answer separated by a line break).

    --Creates temporary table for testing, ID column and second set of data
    --used to ensure query works for each unique set of questions
    IF OBJECT_ID('tempdb..#Results') IS NOT NULL
        DROP TABLE #Results
    
    CREATE TABLE #Results 
        (ID INT IDENTITY(1,1) NOT NULL,
        Notes NVARCHAR(4000) NOT NULL)
    INSERT INTO #Results
        (Notes)
    VALUES
        ('Home Phone: 1234567890  
        Cell Phone: 1234567890  
        Date of Birth: 01/01/1971 
        School Name: James Jones High  School 
        Address:123 Main Street 
        School City: Queens  
        School State: PA  
        School Zip: 32112 
        Years Teaching: 12  
        Grade Levels: Middle School  
        Total Students: 120  
        Subject: Music:   
        How did they hear:  Other, provide more info: Former partner teacher in the Middle School 
        Type: Public/Charter '),
        ('Home Phone: test  
        Cell Phone: test 
        Date of Birth: test
        School Name: test
        Address:test 
        School City: test 
        School State: test  
        School Zip: test 
        Years Teaching: test 
        Grade Levels: test 
        Total Students: test
        Subject: test   
        How did they hear:  test 
        Type: test ');
    
    --Recursive CTE to determine the position of each successive line break
    --Used CHARINDEX to search CHAR(13) and CHAR(10) and find line breaks and carriage returns
    WITH cte
    AS
    
        (SELECT ID, Notes, 1 AS Starts, CHARINDEX(CHAR(13)+CHAR(10),Notes) AS Pos
        FROM #Results
        UNION ALL
        SELECT ID, Notes, Pos +1, CHARINDEX(CHAR(13)+CHAR(10),Notes,Pos+1) AS Pos
        FROM cte
        WHERE
            pos >0),
    
    --2nd CTE breaks each question set into it's own row
    cte2
    AS
        (SELECT ID, Notes,Starts, Pos,
            SUBSTRING(Notes, Starts,
                CASE
                    WHEN pos > 0 THEN (pos - starts)
                    ELSE LEN(notes)
                END) AS Token
        FROM cte),
    
    --3rd CTE cleans up the data, separating the Questions/Answers into separate columns
    --REPLACE is used to remove Line Break (CHAR(10)), output was then showing a TAB so used
    --double REPLACE and removed CHAR(9) (tab)
    --LTRIM removes leading space
    cte3
    AS
        (SELECT ID, 
            LTRIM(REPLACE(REPLACE(SUBSTRING(Token,CHARINDEX(CHAR(13)+CHAR(10),Token),CHARINDEX(':',Token)),CHAR(10),''),CHAR(9),'')) AS Question, 
            LTRIM(SUBSTRING(Token,CHARINDEX(':',Token)+1,4000)) AS Answer
        FROM cte2)
    
    --Pivot separates each Question/Answer row into it's own column
    SELECT *
    FROM
        (SELECT ID, Question, Answer
        FROM cte3) AS a
    PIVOT
        (MAX(Answer)
        FOR [Question] IN([Address],[Cell Phone],[Date of Birth],[Grade Levels],[Home Phone],[How did they hear],
                            [School City],[School Name],[School State],[School Zip],[Subject],[Total Students],[Type],[Years Teaching])) AS pvt
    

    I put comments on each section to hopefully explain my logic but let me know if you have any questions.

    EDIT: Dynamic Pivot

    It is possible to use dynamic SQL to create a PIVOT that will automatically pick up on all the "Question" columns and adjust accordingly. I do not believe it can be done in one step since I had to use the multiple CTEs. What I would do is take the above steps used to create CTE, CTE2, and CTE3 (basically everything before the PIVOT query) and CREATE a VIEW of those steps, then with that view do the following (for my example the view is called "Questionaire")

    DECLARE @columns AS NVARCHAR(MAX)
    DECLARE @query AS NVARCHAR(MAX)
    
    SET @columns =  STUFF((SELECT DISTINCT ',' + QUOTENAME(q.question)
            FROM questionaire AS q
            FOR XML PATH(''), TYPE
            ).value('.','NVARCHAR(MAX)')
            ,1,1,'')
    
    SET @query =    'SELECT ID, '+ @columns +' FROM
            (
                SELECT ID, Answer, Question
                FROM questionaire
            ) AS a
            PIVOT
            (
                MAX(Answer)
                FOR Question IN(' +@columns+')
            ) AS p'
    EXECUTE(@query)