pymssql : How to get distinct rows when a column has XML values

I am trying to use distinct in a query for a table that has XML values. Here is my query

select distinct pID,docXml from docTable where docXml is not null and pID not like '%xyz%'

but i get

pymssql.OperationalError: (421, 'The xml data type cannot be selected as DISTINCT because it is not comparable.DB-Lib error message 20018, severity 16:\nGeneral SQL Server error: Check messages from the SQL Server\n')

how to solve this? what does it mean?

the data in the docXml column looks like

<document>
    <pages>

    <page>   
       <paragraph>XBV</paragraph>

       <paragraph>GFH</paragraph>
    </page>

    <page>
       <paragraph>ash</paragraph>

       <paragraph>lplp</paragraph>
    </page>

    </pages>
</document>

note that if i dont use distinct, then i am easily able to get all the data. but since there are duplicates, i am only interested in unique rows.

Solution

you could try

CONVERT(NVARCHAR(MAX), docXml)

in the select statement (and perhaps put the whole thing in a CTE then convert it back to XML from the CTE) e.g.

; WITH T AS (select distinct pID,CONVERT(NVARCHAR(MAX), docXml) docXml from docTable where docXml is not null and pID not like '%xyz%')
SELECT pID, CONVERT(XML, docXml) FROM T

alternatively, if your pID is distinct in itself (such that each pID relates to a single docXml), you could use a CTE and ROW_NUMBER() to return docXml in its XML format. e.g.

; WITH T AS (select pID
    ,docXml
    ,ROW_NUMBER() OVER (PARTITION BY pID ORDER BY pID) RN
    from docTable 
    where docXml is not null 
    and pID not like '%xyz%')
SELECT pID, docXml FROM T WHERE RN = 1

EDIT: as for why it's doing it, SQL doesn't like sorting CLOBs