This article explores the creation of a data container for game programming. This container is meant to take the place of C++ standard library containers such as std::map and std::unordered_map, with an alternative that stores data contiguously in memory. Data should be contiguous to maximize cache locality when iterating over the data, resulting in faster performance vs. containers that individually allocate elements on the heap. The goal here is not to produce a drop-in replacement for the STL containers. That being said, whenever practical I tried to remain consistent with the conventions and semantics established in the STL. The std::vector is my poor-man's memory allocator of choice.
What sets this container apart from other associative containers like std::map and std::unordered_map, is that we do not specify the key when inserting data. This is why I'm referring to the key as a handle. We receive a handle back from the container upon inserting. Quite often when using associative containers, I find that the values of keys are not actually important. We just end up storing the keys and using them only for lookup purposes anyway. Often we use hash maps purely for the associative semantics of the container, and not because we actually care about the key values. We're still forced into supplying a key, which can lead to convenient but flawed choices like std::string. This is where handles can be a great alternative. We still have to store the handle somewhere, but the burden of type selection and supplying the value is relieved from the programmer. Handles also hold information internally that make this container smarter and safer than bog standard hash maps.
So now that I've given an overview of the container and why you might want to use it, let's outline the goals for the implementation.
Store data in a contiguous block of memory
This is key to our performance goals. We will separate our container into a dense set and a sparse set, where the data itself is kept in the dense set, tightly packed to ensure best cache locality during traversals. Even though this is an associative container, semantically it doesn't make sense to traverse in the order of the handles' values. That would also incur a performance penalty due to the double-indirection of handle lookups (more on that later). More likely, what you'll want to do is either look up a single entry by handle, or traverse all entries by their in-memory order just as you would an array/vector, completely bypassing the use of handles. This is not an ordered container, so traversal of the dense set is not guaranteed to visit elements in the same order each time.Create an associative mapping between a handle and the data
The sparse set will play two important roles. First, it is the handle index and remains stable with insertions and deletions. This allows us to adjust the memory location of elements in the dense set to keep them tightly packed, without invalidating our lookup value. The second role of the sparse set is to internally store a freelist. Gaps in the sparse set resulting from deletions will be filled by subsequent insertions. The sparse set will always fill gaps before growing, and thus will itself try to remain dense.O(1) lookups
Constant time lookups are achieved because handles contain information about their location in the sparse set. This makes lookups into the sparse set nothing more than an array lookup. The information contained in the sparse set knows the item's exact location in the dense set, so again just an array lookup. One consideration for lookups by handle vs. a direct pointer dereference is the cost of an extra indirection, so traversal "by handle" (over the sparse set) is not recommended. Therefore, iterators to the dense set are provided by the handle_map for traversals.O(1) insertions
Constant time insertions are done by pushing the new item to the back of the dense set, and using the first available slot in the sparse set's freelist for the index.O(1) removals
Constant time removal is made possible by using the "swap and pop" trick in the dense set. That is, we swap the item being removed with the last item in the array, and then pop_back to remove it. We also have to update the index of the last item referenced in the sparse set. We have to discover the location in the sparse set of the corresponding inner id for the (former) last item that we are swapping with. The index field must be changed to reference its new location in the dense set. One option would be to linearly search through the sparse set until we find the match, but to achieve O(1), we instead introduce a third storage set (in addition to the dense set and sparse set) which we'll term the meta set. The meta set is maintained in parallel with the dense set, and contains an index back to the sparse set, giving us a two-way link. We do not combine the meta information directly into the dense set because it's currently only used for deletion, and there is no sense in harming cache locality and alignment of the main object store for infrequently used data.
Let's begin with an explanation of the handle structure itself. The handle is a 64-bit value, so there is no additional overhead compared to a pointer on 64-bit platforms.
/** * @struct Id_T * @var free 0 if active, 1 if slot is part of freelist, only applicable to inner ids * @var typeId relates to m_itemTypeId parameter of handle_map * @var generation incrementing generation of data at the index, for tracking accesses to old data * @var index When used as a handle (outer id, given to the client): * free==0, index of id in the sparseIds array * When used as an inner id (stored in sparseIds array): * free==0, index of the item in the dense items array * free==1, index of next free slot, forming an embedded linked list * @var value unioned with the above four vars, used for direct comparison of ids */ struct Id_T { union { /** * the order of this bitfield is important for sorting prioritized by free, then typeId, * then generation, then index */ struct { uint32_t index; uint16_t generation; uint16_t typeId : 15; uint16_t free : 1; }; uint64_t value; }; }; typedef std::vector<Id_t> IdSet_T; #define NullId_T Id_T{}
It is worth mentioning the two ways that the handle structure is used, which I've termed "outer id" and "inner id." The outer id is an Id_T value that's given to the user of the container as a handle. The inner id is the related Id_T value that is stored in the sparse set. The two values differ in how the index property is used. The outer id stores an index into the sparse set, while the inner id stores an index into the dense set.
The typeId field is optional extra metadata, set in the container's constructor, that can introduce an element of type safety into your outer id handles. For example, say you have two handle_maps, one storing Apples and the other storing Oranges. The typeId would differentiate handles between the two, that would otherwise be equivalent. The programmer simply needs a mechanism to assign a unique 15-bit integer for each type contained within a handle_map. In many cases, you'll just bypass this and set zero. However, in cases where many types are tracked in an array of handle_maps, like components in an entity component system (ECS), or fields in a SoA inversion, the typeId can come in very handy. In addition to type safety, the typeId can be used as an additional O(1) lookup to first get you into the correct handle_map (within an array of handle_maps), before getting you to the final value. This is exactly how automatic storage for all component types is done in the griffin engine's ECS implementation.
The generation field is used as a safety mechanism to detect when a stale handle is trying to access data that has since been overwritten in the corresponding slot. Every time a slot in the sparse set is removed, the generation increments. Handle lookups assert that the outer id and inner id generations match. If debug asserts aren't safe enough for your purposes, an exception could be thrown and handled at runtime, but I view stale lookups as a bug and prefer to just crash hard and fix the issue.
The free field is a 1-bit flag to indicate if the sparse set entry represents a value, or if it is actually empty and part of the internal freelist. When free==1, the inner id's index field no longer stores a lookup index into the dense set, and instead stores the index of the next free slot in the internal freelist. This gives us an embedded singly linked list within the sparse set array, and as long as we store the front index of the freelist separately, it is an O(1) operation to find the next available slot and maintain the singly linked list.
The following free functions are used to directly compare Id_T types.
// struct Id_T comparison functions inline bool operator==(const Id_T& a, const Id_T& b) { return (a.value == b.value); } inline bool operator!=(const Id_T& a, const Id_T& b) { return (a.value != b.value); } inline bool operator< (const Id_T& a, const Id_T& b) { return (a.value < b.value); } inline bool operator> (const Id_T& a, const Id_T& b) { return (a.value > b.value); }
Let's jump into the handle_map itself. I've removed comments for brevity, so we can focus on the important parts. Links to the full implementation are provided at the bottom of the article.
template <typename T> class handle_map { public: struct Meta_T { uint32_t denseToSparse; //!< index into m_sparseIds array stored in m_meta }; typedef std::vector<T> DenseSet_T; typedef std::vector<Meta_T> MetaSet_T; // Functions T& at(Id_T handle); const T& at(Id_T handle) const; T& operator[](Id_T handle) { return at(handle); } const T& operator[](Id_T handle) const { return at(handle); } template <typename... Params> Id_T emplace(Params... args) { return insert(T{ args... }); } template <typename... Params> IdSet_T emplaceItems(int n, Params... args); typename DenseSet_T::iterator begin() { return m_items.begin(); } typename DenseSet_T::const_iterator cbegin() const { return m_items.cbegin(); } typename DenseSet_T::iterator end() { return m_items.end(); } typename DenseSet_T::const_iterator cend() const { return m_items.cend(); } size_t erase(Id_T handle); size_t eraseItems(const IdSet_T& handles); Id_T insert(T&& i); Id_T insert(const T& i); void clear() _NOEXCEPT; void reset() _NOEXCEPT; bool isValid(Id_T handle) const; size_t size() const _NOEXCEPT { return m_items.size(); } size_t capacity() const _NOEXCEPT { return m_items.capacity(); } template <typename Compare> size_t defragment(Compare comp, size_t maxSwaps = 0); DenseSet_T& getItems() { return m_items; } const DenseSet_T& getItems() const { return m_items; } MetaSet_T& getMeta() { return m_meta; } const MetaSet_T& getMeta() const { return m_meta; } IdSet_T& getIds() { return m_sparseIds; } const IdSet_T& getIds() const { return m_sparseIds; } uint32_t getFreeListFront() const { return m_freeListFront; } uint32_t getFreeListBack() const { return m_freeListBack; } uint16_t getItemTypeId() const { return m_itemTypeId; } uint32_t getInnerIndex(Id_T handle) const; explicit handle_map(uint16_t itemTypeId, size_t reserveCount) : m_itemTypeId(itemTypeId) { m_sparseIds.reserve(reserveCount); m_items.reserve(reserveCount); m_meta.reserve(reserveCount); } private: bool freeListEmpty() const { return (m_freeListFront == 0xFFFFFFFF); } // Variables uint32_t m_freeListFront = 0xFFFFFFFF; //!< start index in the embedded ComponentId freelist uint32_t m_freeListBack = 0xFFFFFFFF; //!< last index in the freelist uint16_t m_itemTypeId; //!< the Id_T::typeId to use for ids produced by this handle_map<T> uint8_t m_fragmented = 0; //<! set to 1 if modified by insert or erase since last complete defragment IdSet_T m_sparseIds; //!< stores a set of Id_Ts, these are "inner" ids indexing into m_items DenseSet_T m_items; //!< stores items of type T MetaSet_T m_meta; //!< stores Meta_T type for each item };
The insert and erase functions do most of the heavy lifting to keep things consistent.
template <typename T> Id_T handle_map<T>::insert(T&& i) { Id_T handle = { 0 }; m_fragmented = 1; if (freeListEmpty()) { Id_T innerId = { (uint32_t)m_items.size(), 1, m_itemTypeId, 0 }; handle = innerId; handle.index = (uint32_t)m_sparseIds.size(); m_sparseIds.push_back(innerId); } else { uint32_t outerIndex = m_freeListFront; Id_T &innerId = m_sparseIds.at(outerIndex); m_freeListFront = innerId.index; // the index of a free slot refers to the next free slot if (freeListEmpty()) { m_freeListBack = m_freeListFront; } // convert the index from freelist to inner index innerId.free = 0; innerId.index = (uint32_t)m_items.size(); handle = innerId; handle.index = outerIndex; } m_items.push_back(std::forward<t>(i)); m_meta.push_back({ handle.index }); return handle; } template <typename T> size_t handle_map<T>::erase(Id_T handle) { if (!isValid(handle)) { return 0; } m_fragmented = 1; Id_T innerId = m_sparseIds[handle.index]; uint32_t innerIndex = innerId.index; // push this slot to the back of the freelist innerId.free = 1; ++innerId.generation; // increment generation so remaining outer ids go stale innerId.index = 0xFFFFFFFF; // max numeric value represents the end of the freelist m_sparseIds[handle.index] = innerId; // write outer id changes back to the array if (freeListEmpty()) { // if the freelist was empty, it now starts (and ends) at this index m_freeListFront = handle.index; m_freeListBack = m_freeListFront; } else { m_sparseIds[m_freeListBack].index = handle.index; // previous back of the freelist points to new back m_freeListBack = handle.index; // new freelist back is stored } // remove the component by swapping with the last element, then pop_back if (m_items.size() > 1) { std::swap(m_items.at(innerIndex), m_items.back()); std::swap(m_meta.at(innerIndex), m_meta.back()); // fix the ComponentId index of the swapped component m_sparseIds[m_meta.at(innerIndex).denseToSparse].index = innerIndex; } m_items.pop_back(); m_meta.pop_back(); return 1; }
In the spirit of the data-oriented programming mantra "where there is one, there are many" I have provided a couple of methods to emplace and erase items in batch. We also get to have some fun with C++11 variadic templates, which enable us to pass any constructor arguments needed for the items.
template <typename T> template <typename... Params> IdSet_T handle_map<T>::emplaceItems(int n, Params... args) { IdSet_T handles(n); assert(n > 0 && "emplaceItems called with n = 0"); m_fragmented = 1; m_items.reserve(m_items.size() + n); // reserve the space we need (if not already there) m_meta.reserve(m_meta.size() + n); std::generate_n(handles.begin(), n, [&, this](){ return emplace(args...); }); return handles; // efficient to return vector by value, copy elided with NVRO (or with C++11 move semantics) } template <typename T> size_t handle_map<T>::eraseItems(const IdSet_T& handles) { size_t count = 0; for (auto h : handles) { count += erase(h); } return count; }
The at function shows our O(1) lookup by handle, plus typeId and generation asserts for added safety.
template <typename T> inline T& handle_map<T>::at(Id_T handle) { assert(handle.index < m_sparseIds.size() && "outer index out of range"); Id_T innerId = m_sparseIds[handle.index]; assert(handle.typeId == m_itemTypeId && "typeId mismatch"); assert(handle.generation == innerId.generation && "at called with old generation"); assert(innerId.index < m_items.size() && "inner index out of range"); return m_items[innerId.index]; }
Example:
handle_map<int> testMap(0, 1); Id_T handle = testMap.insert(1); int i = testMap[handle];
There are several areas in a game engine where handle_map can be employed. This container is not a silver bullet, but given the performance characteristics, I often prefer to use it where a plain array or vector are unsuitable.
Some examples include:
Entity component systems -The components of an ECS could be stored in a handle_map, and use handles externally to store references to the entity or its components.
Scene graphs and embedded linked lists -An embedded list or tree structure is straightforward to represent by storing handles instead of direct pointers. Combined with a defragment sort, the flat representation can also be optimized for cache locality of adjacent nodes.
Resource caches -It is common practice to reference resources by handle rather than directly by pointer or shared_pointer, but the handle_map isn't ideal for directly storing large blobs since entries are constantly being swapped around. Consider this pattern instead: cache the resource in a memory location that doesn't change, and store a shared_ptr to it in the handle_map cache. There is more to it, but I am planning a future article covering my resource cache implementation in full detail.
I did some profiling to compare handle_map's performance to a vector of heap allocated unique_ptr's, and unordered_map. This should give a rough idea of the performance advantages over those two alternatives.
Creating 100,000 items:
handle_map: create 100000 items, time = 4.749955 ms unique_ptr: create 100000 items, time = 111.625944 ms unordered_map: create 100000 items, time = 89.358049 ms
Iterating over all items, summing each item into total:
handle_map: iteration over dense set, total = 100000, time = 0.263713 ms unique_ptr: iteration, total = 100000, time = 0.524748 ms unordered_map: iteration by iterator, total = 100000, time = 3.466641 ms
Iterating by lookup:
handle_map: iteration by handle, total = 100000, time = 0.653704 ms unordered_map: iteration by lookup, total = 100000, time = 15.921830 ms
Clearing all items:
handle_map: clear 100000 items, time = 0.377051 ms unique_ptr: clear 100000 items, time = 10161.480165 ms unordered_map: clear 100000 items, time = 7615.724425 ms
Not lightweight
Under the hood, handle_map is implemented with three vectors plus (at least) 11 bytes of additional overhead.Extra indirection on lookups
Lookups by handle require indexing into two arrays, which can add an extra cache miss over just dereferencing a pointer. This isn't something you should worry about too much, just make sure to iterate over the dense set using getItems() whenever possible. If handle_map is used to store components for an ECS, the "system" implementation should be able to bypass handles and operate directly on the dense set.What to do when an obsolete key is used for lookup
My preference is to use this as a debug-time assert. The assumption is that I don't ever expect it to happen, and in a release build we'd have undefined behavior and possibly a crash. It would be trivial to instead throw an exception, and try to recover from the situation in much the same way you would recover from failing to lock a dangling weak_ptr.Thread safety
This container is not thread-safe. Other than concurrent_queue, I'm not a big fan of implementing thread safety at the container level. I find that doing so often leads to two subtle problems. First, over-locking: where each function call is protected by a mutex, and calls to the container's functions become internally serialized. Adjacent function calls lose the ability to reduce contention by sharing a lock. Second, under-locking: where race conditions can still exist because locks are not made at a higher "transactional" level. I prefer instead to do multithreading with tasks.clear() vs. reset()
There are two ways to purge all data from the container. The slower but safer method is to use clear(), which maintains the integrity of the sparse set, increments the generation of all entries, and can detect future lookups with a stale handle. The faster method is to use reset() which destroys the sparse set and empties the freelist.Fragmentation of entries
Over time, items will be added out of order, and removed using the "swap and pop" method. This leads to a loss of adjacency and cache locality for items that were originally (or should be) added next to each other. I have provided a defragment function that allows the client to define a sort order for the dense set.If the implementation looks menacing, it's really not all that bad. You're basically looking at a standard insertion sort, with a small optimization for trivially_copyable types. In my profiling, I found the optimized version to be about 8-10% faster. Additionally, the m_fragmented flag is used to eliminate all work when no changes to the container have occurred since the last defragment completion (which should be the statistically common case).
I chose insertion sort due to its properties of being stable and adaptive, with very little overhead. The assumption I'm making is that initially, the data set may be small but randomly ordered. Over time (with occasional defragmentation), the data set may grow but remain in mostly sorted order. The data set is also likely to have "few unique" items where they just need to be grouped next to each other. Using my new favorite web page, I found that insertion sort looked like a good fit.
template <typename T> template <typename Compare> size_t handle_map<T>::defragment(Compare comp, size_t maxSwaps) { if (m_fragmented == 0) { return 0; } size_t swaps = 0; int i = 1; for (; i < m_items.size() && (maxSwaps == 0 || swaps < maxSwaps); ++i) { T tmp = m_items[i]; Meta_T tmpMeta = m_meta[i]; int j = i - 1; int j1 = j + 1; // trivially copyable implementation if (std::is_trivially_copyable<t>::value) { while (j >= 0 && comp(m_items[j], tmp)) { m_sparseIds[m_meta[j].denseToSparse].index = j1; --j; --j1; } if (j1 != i) { memmove(&m_items[j1+1], &m_items[j1], sizeof(T) * (i - j1)); memmove(&m_meta[j1+1], &m_meta[j1], sizeof(Meta_T) * (i - j1)); ++swaps; m_items[j1] = tmp; m_meta[j1] = tmpMeta; m_sparseIds[m_meta[j1].denseToSparse].index = j1; } } // standard implementation else { while (j >= 0 && (maxSwaps == 0 || swaps < maxSwaps) && comp(m_items[j], tmp)) { m_items[j1] = std::move(m_items[j]); m_meta[j1] = std::move(m_meta[j]); m_sparseIds[m_meta[j1].denseToSparse].index = j1; --j; --j1; ++swaps; } if (j1 != i) { m_items[j1] = tmp; m_meta[j1] = tmpMeta; m_sparseIds[m_meta[j1].denseToSparse].index = j1; } } } if (i == m_items.size()) { m_fragmented = 0; } return swaps; }
Example:
const int N = 100000; struct Test { int val, sort; }; handle_map<Test> testMap(0, N); for (int i = 0; i < N; ++i) { testMap.emplace(1, i); } std::random_shuffle(testMap.begin(), testMap.end()); timer.start(); size_t swaps = testMap.defragment([](const Test& a, const Test& b) { return a.sort > b.sort; }); timer.stop();
As you can see, a near worst case scenario sort (profiled above) can be quite heavy on performance.
handle_map: defragment swaps = 99987, total = 100000, time = 32538.735446 ms
For that reason, the operation is reentrant and can be approached iteratively. Passing maxSwaps will effectively set a ballpark time budget, and calling it periodically will cause the container to always approach the ideal order. Keep in mind, this does not change the fact that this is semantically still an unordered container.
The handle_map is an associative container implemented with the principles of data-oriented design in mind. The container is semantically similar to std::map or std::unordered_map, and exhibits constant time complexity for inserts, lookups, and removals. Handles are used as keys for lookups and contain "smart" internal safety features such as typeId and generation checks. A defragment sort operation can be defined by the user to incrementally approach the ideal in-memory order of the contiguous data set.
- Create a way to specify a custom allocator?
- Turn Meta_T into template parameter to enable additional metadata?
- Anything else?
Github repo: https://github.com/y2kiah/griffin-containers
Online documentation available at: https://y2kiah.github.io/griffin-containers/
7 Sept 2016: Initial release