Search code examples
sql-serverxmlpymssql

pymssql : How to get distinct rows when a column has XML values


I am trying to use distinct in a query for a table that has XML values. Here is my query

select distinct pID,docXml from docTable where docXml is not null and pID not like '%xyz%'

but i get

pymssql.OperationalError: (421, 'The xml data type cannot be selected as DISTINCT because it is not comparable.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')

how to solve this? what does it mean?

the data in the docXml column looks like

<document>
    <pages>

    <page>   
       <paragraph>XBV</paragraph>

       <paragraph>GFH</paragraph>
    </page>

    <page>
       <paragraph>ash</paragraph>

       <paragraph>lplp</paragraph>
    </page>

    </pages>
</document>

note that if i dont use distinct, then i am easily able to get all the data. but since there are duplicates, i am only interested in unique rows.


Solution

  • you could try

    CONVERT(NVARCHAR(MAX), docXml) 
    

    in the select statement (and perhaps put the whole thing in a CTE then convert it back to XML from the CTE) e.g.

    ; WITH T AS (select distinct pID,CONVERT(NVARCHAR(MAX), docXml) docXml from docTable where docXml is not null and pID not like '%xyz%')
    SELECT pID, CONVERT(XML, docXml) FROM T
    

    alternatively, if your pID is distinct in itself (such that each pID relates to a single docXml), you could use a CTE and ROW_NUMBER() to return docXml in its XML format. e.g.

    ; WITH T AS (select pID
        ,docXml
        ,ROW_NUMBER() OVER (PARTITION BY pID ORDER BY pID) RN
        from docTable 
        where docXml is not null 
        and pID not like '%xyz%')
    SELECT pID, docXml FROM T WHERE RN = 1
    

    EDIT: as for why it's doing it, SQL doesn't like sorting CLOBs