Search code examples
pythonduplicatesblobgoogle-cloud-storage

Removing entire string duplicates from a list


I am running into an issue when trying to removing duplicates from a list.

def my_list_bucket(self, bucketName,  limit=sys.maxsize): #delimiter='/'):
    a_bucket = self.storage_client.lookup_bucket(bucketName)
    bucket_iterator = a_bucket.list_blobs()
        for resource in bucket_iterator:
            path_parts = resource.name.split('/')
            date_folder = path_parts[0]
            publisher_folder = path_parts[1]
            desired_path = date_folder + '/' + publisher_folder + '/'
            new_list = []
            for path in desired_path:
                if desired_path not in new_list:
                    new_list.append(desired_path)
            print(new_list)
            limit = limit - 1
            if limit <= `0:
                break

This is the results I get: 20230130/adelphic/
20230130/adelphic/
20230130/adelphic/
20230130/adelphic/
20230130/instacart/
20230130/instacart/
20230130/instacart/
20230130/instacart/

Its not removing the duplicates from the list as the duplicates are still there.

The results I want is:
20230130/adelphic/
20230130/instacart/

I have tried new_list = list(set(publisher_folder)) and it returns:
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'
'i', 'p', 'a', 'c', 'd', 'h', 'e', 'l'


Solution

  • When you do:

    for path in desired_path:`
    

    it is essentially:

    for character in desired_path:
    

    at the moment since desired_path is a string that looks like "20230130/adelphic/".

    At the moment your code breaks these strings into characters and reassemble them back into their original strings to print.

    I assume what you seek is a list of distinct such strings and that might be done by:

    import sys
    
    def my_list_bucket(self, bucketName, limit=sys.maxsize): #delimiter='/'):
        a_bucket = self.storage_client.lookup_bucket(bucketName)
        new_list = set()
        for resource in a_bucket.list_blobs():
            new_list.add(f"{ '/'.join(resource.name.split('/')[:2]) }/")
            limit -= 1
            if not limit:
                break
        new_list = list(new_list)
        print(new_list)
    

    or potentially:

    def my_list_bucket(self, bucketName, limit=sys.maxsize): #delimiter='/'):
        a_bucket = self.storage_client.lookup_bucket(bucketName)
        new_list = list(set(
            f"{ '/'.join(resource.name.split('/')[:2]) }/"
            for resource in a_bucket.list_blobs()[:limit]
        ))
        print(new_list)