Time and again, we run into a need to generate unique identifiers for items (for lack of a better word). These items could be entries in a log file, events that are emitted by a system, etc. No matter what the source of such items are, in order to uniquely identify an item, we will have to assign a unique identifier for each item (unless each such item is unique by itself, and the hash of the entire item will do to identify its uniqueness and is redundant to have an extra identifier – which is rare in practical scenarios).
To make it clear, let us outline the design goals of such an identifier:
- Should be unique for each item generated on a single node
- Should be unique for each item generated across many nodes
The simplest approach that works for a single-node is to start with an initial default value such as 0 or 1 and increment the identifier for each item. The first design goal could be achieved by having some sort of synchronization in code to make sure updates and reads to the identifier are consistent.
This simple approach breaks down when you need to ensure uniqueness across multiple nodes since more than one node could easily generate the same identifier if they all go in sequence. A naive approach is to centralize the identifier generator, which brings in all the problems of having a single point of failure. Another approach would be to have a perfectly synchronized cluster but that would be too expensive (in terms of latency, and not to mention the complexity it brings in – do we really need Paxos here?) since that would need each update to be propagated to every other node.
An alternative, yet powerful technique, is to use probabilistic methods.
Each node could generate a (probabilistically unique) identifier by picking a uniformly random sample from a commonly known universe. This approach has the advantage that it does not need any synchronization between the nodes (which is a big plus for performance). There’s no free lunch however. The performance comes with a cost – the chance of a collision happening, for both identifiers that are generated within a node and also those generated across multiple different nodes.
The probability of a collision may not be trivial esp. if the identifiers accumulate over time, and quickly add up. For instance, if you are using a 64-bit unsigned integer for storing the identifier, then the probability of a collision reaches 50% with about 5 billion items. Depending on the volume of the generated items, this may or may not be a problem for you. The number of items needed for collision is quite surprising until you learn about the birthday paradox.
Another solution is to use a combination of timestamp plus a random part for generating the identifiers (much like UUIDs). Even though the overall universe from which we generate identifiers still remains the same (64-bits), splitting the identifier this way would help diversify the probabilities so as to alleviate the problem of collisions. For instance, if you split the identifier into two parts, first 32-bits for the epoch time (in seconds), plus the second 32-bit random part, then the number of identifiers that need to be generated in the same second, so that the probability of a collision reaches 1 in 100 is 10000. i.e. if you are sure to generate well below 10000 identifiers in any given second (i.e. low burst-rate), then you may get more mileage with the 64-bit space. This may prove useful esp. considering that most data tends to get stale over time, and hence the identifiers might be unique for just enough time period for which they are of business value.
To conclude:
- If it is a greenfield project consider using UUIDs or GUIDs – check for library support in your development ecosystem
If you are constrained not to use UUIDs (e.g. due to downstream dependencies), then:
- If perfect synchronization is already available/implemented between nodes, then look no further, you already have a solution!
- Probabilistic methods may be super effective in some scenarios, like we saw above. Use as big a universe of space as feasible, for storing the identifier, to reduce chances of a collision