I need to load data from a csv file or an excel sheet (with rows and columns) into a two-dimensional python dict. For example, if the data in an excel sheet looks like this:
name age gender location
1 Jim 18 male China
2 Ross 18 male China
3 Cara 19 female Japan
4 Ted 18 male China
Then the output python dict should look like this:
data = {
1: {'name': 'Jim', 'age': 18, 'gender': 'male', 'location': 'China'},
2: {'name': 'Ross', 'age': 18, 'gender': 'male', 'location': 'China'},
3: {'name': 'Cara', 'age': 19, 'gender': 'female', 'location': 'Japan'},
4: {'name': 'Ted', 'age': 18, 'gender': 'male', 'location': 'China'}
}
You can see that there are a lot of duplicate infos in this 2-d dict (and for real data, it has the same condition), so I came up with an idea of developing a new dict with shared memory. To be specific, in the example above, I want my 2-d dict only save one copy of {'age': 18, 'gender': 'male', 'location': 'China'}
across multiple rows (these rows do not need to be adjacent). If we call data[1]['age']
and data[2]['age']
, it should do the lookup in the same extracted small shared dict.
I have read the source code of python dict, and I know python dict only save pointers to keys and values (and usually for small int and string object, different pointers may point to the same object). So when I mean I want to save only one copy, I mean one copy of pointers.
Any idea about how to design this dict? Thanks very much!!!
EDIT
Sorry, I forget to mention. The data in this 2-d dict will be read-only.
I guess you are asking about a data compression solution, which should then consider both memory sizes and use of references. The smallest memory footprint typically belongs to an integer which should be at least as small as a memory reference, so I would try to map everything to integers unless it is too inconvenient. Also, lists are smaller than dictionaries and allow direct fast indexing.
Here is an alternative implementation that might spark some ideas:
import sys
data = {
1: {'name': 'Jim', 'age': 18, 'gender': 'male', 'location': 'China'},
2: {'name': 'Ross', 'age': 18, 'gender': 'male', 'location': 'China'},
3: {'name': 'Cara', 'age': 19, 'gender': 'female', 'location': 'Japan'},
4: {'name': 'Ted', 'age': 18, 'gender': 'male', 'location': 'China'}
}
In [43]: sys.getsizeof(data)
Out[43]: 280 # bytes
data_list = [
('Jim', 18, 0, 'CH'), # 'CH' => 'China'
('Ross', 18, 0, 'CH'), # 0 => Female, 1 => Male
('Cara', 19, 1, 'JP'), # 'JP' => 'Japan'
('Ted', 18, 0, 'CH')
]
In [44]: sys.getsizeof(data_list)
Out[44]: 104 # bytes
_name, _age, _gender, _location = 0, 1, 2, 3
In [45]: data_list[2][_age] # access as 2D array instead of 2-level dict
Out[45]: 19
The solution above will be a little slower but yield some benefits for large strings. Using references probably won't save you anything unless each record starts to get long. Finally, if you replace all values with integers instead of string names and country codes, you will compress using Python lists quite a bit.
If you really want to get into choosing numeric codes that will give best compression, look into Huffman Coding, eg this site: http://www.geeksforgeeks.org/greedy-algorithms-set-3-huffman-coding