Solving problem is about exposing yourself to as many situations as possible like What is the best way to implement nested dictionaries? and practice these strategies over and over. With time, it becomes second nature and a natural way you approach any problems in general. Big or small, always start with a plan, use other strategies mentioned here till you are confident and ready to code the solution.
In this post, my aim is to share an overview the topic about What is the best way to implement nested dictionaries?, which can be followed any time. Take easy to follow this discuss.
I have a data structure which essentially amounts to a nested dictionary. Let’s say it looks like this:
{'new jersey': {'mercer county': {'plumbers': 3,
'programmers': 81},
'middlesex county': {'programmers': 81,
'salesmen': 62}},
'new york': {'queens county': {'plumbers': 9,
'salesmen': 36}}}
Now, maintaining and creating this is pretty painful; every time I have a new state/county/profession I have to create the lower layer dictionaries via obnoxious try/catch blocks. Moreover, I have to create annoying nested iterators if I want to go over all the values.
I could also use tuples as keys, like such:
{('new jersey', 'mercer county', 'plumbers'): 3,
('new jersey', 'mercer county', 'programmers'): 81,
('new jersey', 'middlesex county', 'programmers'): 81,
('new jersey', 'middlesex county', 'salesmen'): 62,
('new york', 'queens county', 'plumbers'): 9,
('new york', 'queens county', 'salesmen'): 36}
This makes iterating over the values very simple and natural, but it is more syntactically painful to do things like aggregations and looking at subsets of the dictionary (e.g. if I just want to go state-by-state).
Basically, sometimes I want to think of a nested dictionary as a flat dictionary, and sometimes I want to think of it indeed as a complex hierarchy. I could wrap this all in a class, but it seems like someone might have done this already. Alternatively, it seems like there might be some really elegant syntactical constructions to do this.
How could I do this better?
Addendum: I’m aware of setdefault()
but it doesn’t really make for clean syntax. Also, each sub-dictionary you create still needs to have setdefault()
manually set.
Answer #1:
What is the best way to implement nested dictionaries in Python?
This is a bad idea, don’t do it. Instead, use a regular dictionary and use dict.setdefault
where apropos, so when keys are missing under normal usage you get the expected KeyError
. If you insist on getting this behavior, here’s how to shoot yourself in the foot:
Implement __missing__
on a dict
subclass to set and return a new instance.
This approach has been available (and documented) since Python 2.5, and (particularly valuable to me) it pretty prints just like a normal dict, instead of the ugly printing of an autovivified defaultdict:
class Vividict(dict):
def __missing__(self, key):
value = self[key] = type(self)() # retain local pointer to value
return value # faster to return than dict lookup
(Note self[key]
is on the left-hand side of assignment, so there’s no recursion here.)
and say you have some data:
data = {('new jersey', 'mercer county', 'plumbers'): 3,
('new jersey', 'mercer county', 'programmers'): 81,
('new jersey', 'middlesex county', 'programmers'): 81,
('new jersey', 'middlesex county', 'salesmen'): 62,
('new york', 'queens county', 'plumbers'): 9,
('new york', 'queens county', 'salesmen'): 36}
Here’s our usage code:
vividict = Vividict()
for (state, county, occupation), number in data.items():
vividict[state][county][occupation] = number
And now:
>>> import pprint
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
'programmers': 81},
'middlesex county': {'programmers': 81,
'salesmen': 62}},
'new york': {'queens county': {'plumbers': 9,
'salesmen': 36}}}
Criticism
A criticism of this type of container is that if the user misspells a key, our code could fail silently:
>>> vividict['new york']['queens counyt']
{}
And additionally now we’d have a misspelled county in our data:
>>> pprint.pprint(vividict, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
'programmers': 81},
'middlesex county': {'programmers': 81,
'salesmen': 62}},
'new york': {'queens county': {'plumbers': 9,
'salesmen': 36},
'queens counyt': {}}}
Explanation:
We’re just providing another nested instance of our class Vividict
whenever a key is accessed but missing. (Returning the value assignment is useful because it avoids us additionally calling the getter on the dict, and unfortunately, we can’t return it as it is being set.)
Note, these are the same semantics as the most upvoted answer but in half the lines of code – nosklo’s implementation:
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
Demonstration of Usage
Below is just an example of how this dict could be easily used to create a nested dict structure on the fly. This can quickly create a hierarchical tree structure as deeply as you might want to go.
import pprint
class Vividict(dict):
def __missing__(self, key):
value = self[key] = type(self)()
return value
d = Vividict()
d['foo']['bar']
d['foo']['baz']
d['fizz']['buzz']
d['primary']['secondary']['tertiary']['quaternary']
pprint.pprint(d)
Which outputs:
{'fizz': {'buzz': {}},
'foo': {'bar': {}, 'baz': {}},
'primary': {'secondary': {'tertiary': {'quaternary': {}}}}}
And as the last line shows, it pretty prints beautifully and in order for manual inspection. But if you want to visually inspect your data, implementing __missing__
to set a new instance of its class to the key and return it is a far better solution.
Other alternatives, for contrast:
dict.setdefault
Although the asker thinks this isn’t clean, I find it preferable to the Vividict
myself.
d = {} # or dict()
for (state, county, occupation), number in data.items():
d.setdefault(state, {}).setdefault(county, {})[occupation] = number
and now:
>>> pprint.pprint(d, width=40)
{'new jersey': {'mercer county': {'plumbers': 3,
'programmers': 81},
'middlesex county': {'programmers': 81,
'salesmen': 62}},
'new york': {'queens county': {'plumbers': 9,
'salesmen': 36}}}
A misspelling would fail noisily, and not clutter our data with bad information:
>>> d['new york']['queens counyt']
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
KeyError: 'queens counyt'
Additionally, I think setdefault works great when used in loops and you don’t know what you’re going to get for keys, but repetitive usage becomes quite burdensome, and I don’t think anyone would want to keep up the following:
d = dict()
d.setdefault('foo', {}).setdefault('bar', {})
d.setdefault('foo', {}).setdefault('baz', {})
d.setdefault('fizz', {}).setdefault('buzz', {})
d.setdefault('primary', {}).setdefault('secondary', {}).setdefault('tertiary', {}).setdefault('quaternary', {})
Another criticism is that setdefault requires a new instance whether it is used or not. However, Python (or at least CPython) is rather smart about handling unused and unreferenced new instances, for example, it reuses the location in memory:
>>> id({}), id({}), id({})
(523575344, 523575344, 523575344)
An auto-vivified defaultdict
This is a neat looking implementation, and usage in a script that you’re not inspecting the data on would be as useful as implementing __missing__
:
from collections import defaultdict
def vivdict():
return defaultdict(vivdict)
But if you need to inspect your data, the results of an auto-vivified defaultdict populated with data in the same way looks like this:
>>> d = vivdict(); d['foo']['bar']; d['foo']['baz']; d['fizz']['buzz']; d['primary']['secondary']['tertiary']['quaternary']; import pprint;
>>> pprint.pprint(d)
defaultdict(<function vivdict at 0x17B01870>, {'foo': defaultdict(<function vivdict
at 0x17B01870>, {'baz': defaultdict(<function vivdict at 0x17B01870>, {}), 'bar':
defaultdict(<function vivdict at 0x17B01870>, {})}), 'primary': defaultdict(<function
vivdict at 0x17B01870>, {'secondary': defaultdict(<function vivdict at 0x17B01870>,
{'tertiary': defaultdict(<function vivdict at 0x17B01870>, {'quaternary': defaultdict(
<function vivdict at 0x17B01870>, {})})})}), 'fizz': defaultdict(<function vivdict at
0x17B01870>, {'buzz': defaultdict(<function vivdict at 0x17B01870>, {})})})
This output is quite inelegant, and the results are quite unreadable. The solution typically given is to recursively convert back to a dict for manual inspection. This non-trivial solution is left as an exercise for the reader.
Performance
Finally, let’s look at performance. I’m subtracting the costs of instantiation.
>>> import timeit
>>> min(timeit.repeat(lambda: {}.setdefault('foo', {}))) - min(timeit.repeat(lambda: {}))
0.13612580299377441
>>> min(timeit.repeat(lambda: vivdict()['foo'])) - min(timeit.repeat(lambda: vivdict()))
0.2936999797821045
>>> min(timeit.repeat(lambda: Vividict()['foo'])) - min(timeit.repeat(lambda: Vividict()))
0.5354437828063965
>>> min(timeit.repeat(lambda: AutoVivification()['foo'])) - min(timeit.repeat(lambda: AutoVivification()))
2.138362169265747
Based on performance, dict.setdefault
works the best. I’d highly recommend it for production code, in cases where you care about execution speed.
If you need this for interactive use (in an IPython notebook, perhaps) then performance doesn’t really matter – in which case, I’d go with Vividict for readability of the output. Compared to the AutoVivification object (which uses __getitem__
instead of __missing__
, which was made for this purpose) it is far superior.
Conclusion
Implementing __missing__
on a subclassed dict
to set and return a new instance is slightly more difficult than alternatives but has the benefits of
- easy instantiation
- easy data population
- easy data viewing
and because it is less complicated and more performant than modifying __getitem__
, it should be preferred to that method.
Nevertheless, it has drawbacks:
- Bad lookups will fail silently.
- The bad lookup will remain in the dictionary.
Thus I personally prefer setdefault
to the other solutions, and have in every situation where I have needed this sort of behavior.
Answer #2:
class AutoVivification(dict):
"""Implementation of perl's autovivification feature."""
def __getitem__(self, item):
try:
return dict.__getitem__(self, item)
except KeyError:
value = self[item] = type(self)()
return value
Testing:
a = AutoVivification()
a[1][2][3] = 4
a[1][3][3] = 5
a[1][2]['test'] = 6
print a
Output:
{1: {2: {'test': 6, 3: 4}, 3: {3: 5}}}
Answer #3:
Just because I haven’t seen one this small, here’s a dict that gets as nested as you like, no sweat:
# yo dawg, i heard you liked dicts
def yodict():
return defaultdict(yodict)
Answer #4:
You could create a YAML file and read it in using PyYaml.
Step 1: Create a YAML file, “employment.yml”:
new jersey:
mercer county:
pumbers: 3
programmers: 81
middlesex county:
salesmen: 62
programmers: 81
new york:
queens county:
plumbers: 9
salesmen: 36
Step 2: Read it in Python
import yaml
file_handle = open("employment.yml")
my_shnazzy_dictionary = yaml.safe_load(file_handle)
file_handle.close()
and now my_shnazzy_dictionary
has all your values. If you needed to do this on the fly, you can create the YAML as a string and feed that into yaml.safe_load(...)
.
Answer #5:
Since you have a star-schema design, you might want to structure it more like a relational table and less like a dictionary.
import collections
class Jobs( object ):
def __init__( self, state, county, title, count ):
self.state= state
self.count= county
self.title= title
self.count= count
facts = [
Jobs( 'new jersey', 'mercer county', 'plumbers', 3 ),
...
def groupBy( facts, name ):
total= collections.defaultdict( int )
for f in facts:
key= getattr( f, name )
total[key] += f.count
That kind of thing can go a long way to creating a data warehouse-like design without the SQL overheads.
Answer #6:
If the number of nesting levels is small, I use collections.defaultdict
for this:
from collections import defaultdict
def nested_dict_factory():
return defaultdict(int)
def nested_dict_factory2():
return defaultdict(nested_dict_factory)
db = defaultdict(nested_dict_factory2)
db['new jersey']['mercer county']['plumbers'] = 3
db['new jersey']['mercer county']['programmers'] = 81
Using defaultdict
like this avoids a lot of messy setdefault()
, get()
, etc.
Answer #7:
This is a function that returns a nested dictionary of arbitrary depth:
from collections import defaultdict
def make_dict():
return defaultdict(make_dict)
Use it like this:
d=defaultdict(make_dict)
d["food"]["meat"]="beef"
d["food"]["veggie"]="corn"
d["food"]["sweets"]="ice cream"
d["animal"]["pet"]["dog"]="collie"
d["animal"]["pet"]["cat"]="tabby"
d["animal"]["farm animal"]="chicken"
Iterate through everything with something like this:
def iter_all(d,depth=1):
for k,v in d.iteritems():
print "-"*depth,k
if type(v) is defaultdict:
iter_all(v,depth+1)
else:
print "-"*(depth+1),v
iter_all(d)
This prints out:
- food
-- sweets
--- ice cream
-- meat
--- beef
-- veggie
--- corn
- animal
-- pet
--- dog
---- labrador
--- cat
---- tabby
-- farm animal
--- chicken
You might eventually want to make it so that new items can not be added to the dict. It’s easy to recursively convert all these defaultdict
s to normal dict
s.
def dictify(d):
for k,v in d.iteritems():
if isinstance(v,defaultdict):
d[k] = dictify(v)
return dict(d)
Answer #8:
I find setdefault
quite useful; It checks if a key is present and adds it if not:
d = {}
d.setdefault('new jersey', {}).setdefault('mercer county', {})['plumbers'] = 3
setdefault
always returns the relevant key, so you are actually updating the values of ‘d
‘ in place.
When it comes to iterating, I’m sure you could write a generator easily enough if one doesn’t already exist in Python:
def iterateStates(d):
# Let's count up the total number of "plumbers" / "dentists" / etc.
# across all counties and states
job_totals = {}
# I guess this is the annoying nested stuff you were talking about?
for (state, counties) in d.iteritems():
for (county, jobs) in counties.iteritems():
for (job, num) in jobs.iteritems():
# If job isn't already in job_totals, default it to zero
job_totals[job] = job_totals.get(job, 0) + num
# Now return an iterator of (job, number) tuples
return job_totals.iteritems()
# Display all jobs
for (job, num) in iterateStates(d):
print "There are %d %s in total" % (job, num)