Search code examples
pythonsqlsql-serverpandaspypyodbc

Why does .decode() result in NaN values for byte strings that resulted from the pd.read_sql_query function?


I have a MS SQL database that I want to query with Python. I use the following snippet:

cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
                        "Server=xxxx;"
                        "Database=xxxx;"
                        "Trusted_Connection=yes;")

query = 'select * from orders'
df = pd.read_sql_query(query)

This returns the following dataframe:

                      id    ordernumber
0  b'xxxx-xxxx-xxxx-xxxx'   123456789
1  b'xxxx-xxxx-xxxx-xxxx'   123456789
2  b'xxxx-xxxx-xxxx-xxxx'   123456789

The columns are both of type object. The id column is supposed to be a string, but it is a byte string in this output. How do I transform this column to a regular string object? I tried the following (from here):

df['id'].str.decode('utf-8')

... but that transforms the entire id column into np.nan values. How do I properly transform my DataFrame id column into a regular string object?

Dict to recreate this DataFrame:

{'id': {0: "b'DE91EBA3-313D-463C-B948-00010AA26136'",
  1: "b'316E587F-7FDD-4FBA-8778-0001E7783025'",
  2: "b'F6E50A95-A3E8-45D3-8E79-000210CCA14A'",
  3: "b'EE20A958-0CD6-4144-9743-00024D3E703F'",
  4: "b'58AFA9B9-4B10-47D3-8840-000AAD9CBF00'"},
 'ordernumber': {0: '500358152',
  1: '500489603',
  2: '500454759',
  3: '500512969',
  4: '500517135'}}

Solution

  • You don't get bytes b'DE91EBA3-313D-463C-B948-00010AA26136'
    but string "b'DE91EBA3-313D-463C-B948-00010AA26136'" and this makes difference.

    You have to only remove b' ' from string

    df['id'].str.replace("b'", "").str.replace("'", "")
    

    or simpler

    df['id'].str[2:-1]
    

    Minimal working code:

    import pandas as pd
    
    df = pd.DataFrame({
      'id': {
        0: "b'DE91EBA3-313D-463C-B948-00010AA26136'",
        1: "b'316E587F-7FDD-4FBA-8778-0001E7783025'",
        2: "b'F6E50A95-A3E8-45D3-8E79-000210CCA14A'",
        3: "b'EE20A958-0CD6-4144-9743-00024D3E703F'",
        4: "b'58AFA9B9-4B10-47D3-8840-000AAD9CBF00'"
      },
      'ordernumber': {
        0: '500358152',
        1: '500489603',
        2: '500454759',
        3: '500512969',
        4: '500517135'
      }
    })
    
    #df['id'] = df['id'].str[2:-1]
    df['id'] = df['id'].str.replace("b'", "").str.replace("'", "")
    
    print(df)
    

    Result:

                                         id ordernumber
    0  DE91EBA3-313D-463C-B948-00010AA26136   500358152
    1  316E587F-7FDD-4FBA-8778-0001E7783025   500489603
    2  F6E50A95-A3E8-45D3-8E79-000210CCA14A   500454759
    3  EE20A958-0CD6-4144-9743-00024D3E703F   500512969
    4  58AFA9B9-4B10-47D3-8840-000AAD9CBF00   500517135