# How is set() implemented?

Posted on

### Question :

How is set() implemented?

I’ve seen people say that `set` objects in python have O(1) membership-checking. How are they implemented internally to allow this? What sort of data structure does it use? What other implications does that implementation have?

Every answer here was really enlightening, but I can only accept one, so I’ll go with the closest answer to my original question. Thanks all for the info!

Indeed, CPython’s sets are implemented as something like dictionaries
with dummy values (the keys being the members of the set), with some
optimization(s) that exploit this lack of values

So basically a `set` uses a hashtable as its underlying data structure. This explains the O(1) membership checking, since looking up an item in a hashtable is an O(1) operation, on average.

If you are so inclined you can even browse the CPython source code for set which, according to Achim Domma, is mostly a cut-and-paste from the `dict` implementation.

When people say sets have O(1) membership-checking, they are talking about the average case. In the worst case (when all hashed values collide) membership-checking is O(n). See the Python wiki on time complexity.

The Wikipedia article says the best case time complexity for a hash table that does not resize is `O(1 + k/n)`. This result does not directly apply to Python sets since Python sets use a hash table that resizes.

A little further on the Wikipedia article says that for the average case, and assuming a simple uniform hashing function, the time complexity is `O(1/(1-k/n))`, where `k/n` can be bounded by a constant `c<1`.

Big-O refers only to asymptotic behavior as n ? ?.
Since k/n can be bounded by a constant, c<1, independent of n,

`O(1/(1-k/n))` is no bigger than `O(1/(1-c))` which is equivalent to `O(constant)` = `O(1)`.

So assuming uniform simple hashing, on average, membership-checking for Python sets is `O(1)`.

I think its a common mistake, `set` lookup (or hashtable for that matter) are not O(1).
from the Wikipedia

In the simplest model, the hash function is completely unspecified and the table does not resize. For the best possible choice of hash function, a table of size n with open addressing has no collisions and holds up to n elements, with a single comparison for successful lookup, and a table of size n with chaining and k keys has the minimum max(0, k-n) collisions and O(1 + k/n) comparisons for lookup. For the worst choice of hash function, every insertion causes a collision, and hash tables degenerate to linear search, with ?(k) amortized comparisons per insertion and up to k comparisons for a successful lookup.

Related: Is a Java hashmap really O(1)?

We all have easy access to the source, where the comment preceding `set_lookkey()` says:

To emphasize a little more the difference between `set's` and `dict's`, here is an excerpt from the `setobject.c` comment sections, which clarify’s the main difference of set’s against dicts.

Use cases for sets differ considerably from dictionaries where looked-up
keys are more likely to be present. In contrast, sets are primarily
about membership testing where the presence of an element is not known in
advance. Accordingly, the set implementation needs to optimize for both
the found and not-found case.

source on github

Sets in python employ hash table internally. Let us first talk about hash table.
Let there be some elements that you want to store in a hash table and you have 31 places in the hash table where you can do so. Let the elements be: 2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31. When you want to use a hash table, you first determine the indices in the hash table where these elements would be stored. Modulus function is a popular way of determining these indices, so let us say we take one element at a time, multiply it by 100 and apply modulo by 31. It is important that each such operation on an element results in a unique number as an entry in a hash table can store only one element unless chaining is allowed. In this way, each element would be stored at a location governed by the indices obtained through modulo operation. Now if you want to search for an element in a set which essentially stores elements using this hash table, you would obtain the element in O(1) time as the index of the element is computed using the modulo operation in a constant time.
To expound on the modulo operation, let me also write some code:

``````piles = [2.83, 8.23, 9.38, 10.23, 25.58, 0.42, 5.37, 28.10, 32.14, 7.31]

def hash_function(x):
return int(x*100 % 31)

[hash_function(pile) for pile in piles]
``````

Output: [4, 17, 8, 0, 16, 11, 10, 20, 21, 18]