Search code examples
pythondictionaryglob

How can I read to dictionary keys in a way that make sense?


I have about a thousand files that are named in a semi-sensible way like the following:

aaa.ba.ca.01
aaa.ba.ca.02
aaa.ba.ca.03

aaa.ba.da.01
aaa.ba.da.02
aaa.ba.da.03

and so on. Let's say each file contains 2 columns of numbers which I need to read in to the dictionaries: wavelength, flux. The reading in part is easy for me, the hard part is that I need to load these dictionaries so that they store the information like:

wavelength['aaa.ba.ca.01'] (which is the wavelengths of that one file)

wavelength['aaa.ba.ca'] (which is the wavelengths of all subfiles ie ...ca.01, ...ca.02, and ...ca.03 -- in order)

wavelength['aaa.ba'] (which also includes all wavelengths of all "subfiles" as well -- again in order).

and so on. The filenames are well-behaved (the sections are separated by periods, the grouping hierarchy is always the same direction, etc.) but the files can be between 4 sections, or 8 sections long.

My question: is there some sensible way to have python glob the names of the files and by parsing strings or some other magic get the data into these dictionaries? I've hit a brick wall.


Solution

  • A simple, but not efficient, way to do so is to subclass Pythons dictionary, so that when given one non-complete key, it concatenates the contents of all matching keys, in alphabetical order.

    There could be more efficient designs: this one major drawback being it will sort and verify all existing dictionary keys on a partial key request. Otherwise, it is so simple to implement that it is worth a try:

    class MultiDict(dict):
        def __getitem__(self, key):
            if key in self:
                return dict.__getitem__(self, key)
            result = []
            for complete_key in sorted(self.keys()):
                if complete_key.startswith(key):
                    result.extend(self[complete_key])
            return result
    
    #example 
    a = MultiDict()
    a["a0"] = [1]
    a["a1"] = [2]
    print  a["a"]
    [1, 2]
    

    As for getting teh data in the dictionary, just iterate over all files, with glob or os.listdir, and read the desired contents, as a list, into a MultiDict item using the filename as a key.