Search code examples
pythonsetcomparisonstring-comparison

String in set gives weird results


My code is reading the header of a csv file and converting that to a lookup table of column_name=>column_index:

class CSVOutput:
  def __init__(self, csv_file, required_columns):
    csv_reader = csv.reader(csv_file)

    # Construct lookup table for header
    self.header = {}
    for idx, column in enumerate(next(csv_reader)):
      print(f"{column.lower().strip()} == key: {column.lower().strip() == 'key'}")
      print(f"{column.lower().strip()} is key: {column.lower().strip() is 'key'}")
      self.header[column.lower().strip()] = idx

    print(self.header)

     # Load the row data into memory/index it against key
     key_idx = self.header['key']

with open("test.csv") as csv_file:
    data = CSVOutput(csv_file, {})

When I run this, I get the following output and error:

{'key': 0, 'col1': 1, 'col2': 2}

key == key: False
key is key: False
col1 == key: False
col1 is key: False
col2 == key: False
col2 is key: False

Traceback (most recent call last):
  File "D:\compare.py", line 74, in <module>
    actual_data = CSVOutput(act_csv, required_columns)
  File "D:\compare.py", line 40, in __init__
    key_idx = self.header['key']
KeyError: 'key'

Basically there seems to be an inequivalence between the literal 'key' and the 'key' that's loaded from the file. I've tried looking at the source file in notepad++ with show all symbols on, but I'm not seeing any difference. I've also just had a look at the csv file in a hex editor and I can see the start looks like this: Key,  being EF BB BF. I'm not sure if that's the source of my problem, but if it is, why isn't strip() getting rid of it, and how do I handle that?

Any ideas?


Solution

  • EF BB BF

    This is UTF-8 BOM, you might use utf-8-sig encoding to deal with such files. Use encoding of open function following way

    with open("test.csv",encoding="utf-8-sig") as csv_file: