python duplicates blob google-cloud-storage

Removing entire string duplicates from a list

I am running into an issue when trying to removing duplicates from a list.

def my_list_bucket(self, bucketName,  limit=sys.maxsize): #delimiter='/'):
    a_bucket = self.storage_client.lookup_bucket(bucketName)
    bucket_iterator = a_bucket.list_blobs()
        for resource in bucket_iterator:
            path_parts = resource.name.split('/')
            date_folder = path_parts[0]
            publisher_folder = path_parts[1]
            desired_path = date_folder + '/' + publisher_folder + '/'
            new_list = []
            for path in desired_path:
                if desired_path not in new_list:
                    new_list.append(desired_path)
            print(new_list)
            limit = limit - 1
            if limit <= `0:
                break

This is the results I get: 20230130/adelphic/
20230130/adelphic/
20230130/adelphic/
20230130/adelphic/
20230130/instacart/
20230130/instacart/
20230130/instacart/
20230130/instacart/

Its not removing the duplicates from the list as the duplicates are still there.

The results I want is:
20230130/adelphic/
20230130/instacart/

I have tried new_list = list(set(publisher_folder)) and it returns:
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'

Solution

When you do:

for path in desired_path:`

it is essentially:

for character in desired_path:

at the moment since desired_path is a string that looks like "20230130/adelphic/".

At the moment your code breaks these strings into characters and reassemble them back into their original strings to print.

I assume what you seek is a list of distinct such strings and that might be done by:

import sys

def my_list_bucket(self, bucketName, limit=sys.maxsize): #delimiter='/'):
    a_bucket = self.storage_client.lookup_bucket(bucketName)
    new_list = set()
    for resource in a_bucket.list_blobs():
        new_list.add(f"{ '/'.join(resource.name.split('/')[:2]) }/")
        limit -= 1
        if not limit:
            break
    new_list = list(new_list)
    print(new_list)

or potentially:

def my_list_bucket(self, bucketName, limit=sys.maxsize): #delimiter='/'):
    a_bucket = self.storage_client.lookup_bucket(bucketName)
    new_list = list(set(
        f"{ '/'.join(resource.name.split('/')[:2]) }/"
        for resource in a_bucket.list_blobs()[:limit]
    ))
    print(new_list)