Search code examples
csvsortingfilterproductparent

CSV Find Nearly the same value and make them identical and filter the values


Community!

Story: I am trying to upload a CSV file with a huge batch of products to my e-commerce shop. But there are many very similar products, but all with every column slightly different. And luckily the plugin I use can handle this, but it needs the same title for the entire product range or some reference to its parent product. The reference is sadly not there.

Now I want to know how I can find values in a CSV file that are nearly the same (in SQL there was something called '%LIKE%') to structure the table appropriately. I can hardly describe what I want to achieve, but here is an example for what I'm looking for.

I basically want to transform this table:

+---------------+---------------+---------------+---------------+
|      ID       |     Title     |      EAN      |      ...      |
+---------------+---------------+---------------+---------------+
|      1        | AquaMat 3.6ft |   1234567890  |      ...      |
+---------------+---------------+---------------+---------------+
|      2        | AquaMat 3.8ft |   1234567891  |      ...      |
+---------------+---------------+---------------+---------------+
|      3        |  AquaMat 4ft  |   1234567892  |      ...      |
+---------------+---------------+---------------+---------------+

into this:

+---------------+---------------+---------------+---------------+
|      ID       |     Title     |      EAN      |      ...      |
+---------------+---------------+---------------+---------------+
|      1        |    AquaMat    |   1234567890  |      ...      |
+---------------+---------------+---------------+---------------+
|      2        |    AquaMat    |   1234567891  |      ...      |
+---------------+---------------+---------------+---------------+
|      3        |    AquaMat    |   1234567892  |      ...      |
+---------------+---------------+---------------+---------------+

The extra data can be scraped. Can I do this with Excel? With Macros? With Python?

Thank you for taking time and reading this.

If you have any questions, than feel free to ask.

EDIT:

The Title column contains products with completely different names and might even contain more whitespaces. And some products might have 1 attribute but others have up to 3 attributes. But this can be sorted manually.

And with nearly the same I mean as you can see in the table. The Title's are basically the same but not identical. I want to remove the attributes from them. Also, there are no other columns with any more details, only numbers and the attributes that I am trying to cut of the title!!!


Solution

  • Here's an idea using Python and .split():

    import csv
    
    with open('testfile.csv', 'r', encoding="utf-8-sig") as inputfile:
    
     csv_reader = csv.reader(inputfile, delimiter=',')
       with open('outputfile.csv', 'w', newline='') as outputfile:
          w = csv.writer(outputfile)
          header=['ID','Title','EAN','Product','Attr1','Attr2','Attr3']
          w.writerow(header)
          for row in csv_reader:
              if row[0]=='ID':
                  header_row=True
                  pass
              else:
                  header_row=False
                  list=row[1].split()
                  for item in list:
                      # if you want, you can add some other conditions on the attribute (item) in here
                      row.append(item)
                  if not header_row:
                      print('row: {}'.format(row))
                      w.writerow(row)