Search code examples
sql-servert-sqlsubstring

Parse TEXT datatype column to grab values based on TAGS


I have the following string stored in a TEXT datatype which I want to extract the values for

Date:  
Queue:  
File Name: 

and return them in their own columns.

STRING:

If you are able to, please correct the issue and resubmit the file.        

Date: 10/8/2024  
Queue: ENTRY 
File Name: TEST_FILE.PDF

Columns:

Date           Queue          File Name
-------------------------------------------
10/8/2024      ENTRY          TEST_FILE.PDF

I have come up with the following code but have been unable to exclude additional information that comes back.

I get the following data returned:

Date                      Queue                  File Name
--------------------------------------------------------------
10/8/2024    Queue:       ENTRY    File Na       TEST_FILE.PDF
SELECT
    SUBSTRING(CAST(em.body AS NVARCHAR(300)), 
              CHARINDEX('Date:', CAST(em.body AS NVARCHAR(300))) + 6, 
              (CHARINDEX('Queue:', CAST(em.body AS NVARCHAR(300))) - CHARINDEX('Date:', CAST(em.body AS NVARCHAR(300))))) 'Date',
    SUBSTRING(CAST(em.body AS NVARCHAR(300)), 
              CHARINDEX('Queue:', CAST(em.body AS NVARCHAR(300))) + 7, 
              (CHARINDEX('File Name:', CAST(em.body AS NVARCHAR(300))) - CHARINDEX('Queue:', CAST(em.body AS NVARCHAR(300))))) 'Queue',
    RIGHT(CAST(em.body AS NVARCHAR(300)), (LEN(CAST(em.body AS NVARCHAR(300))) - 10) - CHARINDEX('File Name:', CAST(em.body AS NVARCHAR(300)))) 'File Name'
FROM
    email em WITH(NOLOCK)

I know I need to decrease the length value for the SUBSTRING calls, but no matter where I put in a value to decrease them, I get the following error:

Msg 537, Level 16, State 3, Line 2
Invalid length parameter passed to the LEFT or SUBSTRING function


Solution

  • If you find yourself with no choice but to do messy string extractions then the trick is to methodically build up your logic testing each bit at a time.

    Personally I like to use a DRY approach, even though its not typical for SQL, because it reduces the chance of the mistakes which can occur when you repeat logic. This can be done with use of the CROSS APPLY operator.

    This is far from the most concise, but is easier (IMO) to build and maintain.

    You can see that all I define are the 3 identification strings specified and everything else is derived from that.

    CREATE TABLE Email (Body TEXT);
    INSERT INTO Email (Body)
    VALUES
    ('If you are able to, please correct the issue and resubmit the file.        
    
    Date: 10/8/2024  
    Queue: ENTRY 
    File Name: TEST_FILE.PDF');
    
    SELECT
      -- Extract the string segments we require
      SUBSTRING(em.body, c3.DateEndIdx, c3.QueueStartIdx - c3.DateEndIdx) [Date]
      , SUBSTRING(em.body, c3.QueueEndIdx, c3.FileNameStartIdx - c3.QueueEndIdx) Queue
      , SUBSTRING(em.body, c3.FileNameEndIdx, c3.EndOfText - c3.FileNameEndIdx) FileName
    FROM (
      -- Convert to VARCHAR in order to use all string functions
      SELECT CONVERT(VARCHAR(MAX), Body) Body
      FROM Email
    ) em
    -- Capture the strings we are trying to find
    CROSS APPLY (
      VALUES (
        'Date:'
        , 'Queue:'
        , 'File Name:'
      )
    ) c1 (DateLabel, QueueLabel, FileNameLabel)
    -- Find the starts and ends of the strings we are trying to find
    CROSS APPLY (
      VALUES (
        CHARINDEX(c1.DateLabel, em.body)
        , CHARINDEX(c1.QueueLabel, em.body)
        , CHARINDEX(c1.FileNameLabel, em.body)
      )
    ) c2 (DateIdx, QueueIdx, FileNameIdx)
    CROSS APPLY (
      VALUES (
        c2.DateIdx
        , c2.DateIdx + LEN(c1.DateLabel)
        , c2.QueueIdx
        , c2.QueueIdx + LEN(c1.QueueLabel)
        , c2.FileNameIdx
        , c2.FileNameIdx + LEN(c1.FileNameLabel)
        , len(em.body) + 1
      )
    ) c3 (DateStartIdx, DateEndIdx, QueueStartIdx, QueueEndIdx, FileNameStartIdx, FileNameEndIdx, EndOfText);
    
    Date Queue FileName
    10/8/2024
    ENTRY
    TEST_FILE.PDF

    db<>fiddle