The Least Frequently Used (LFU) Cache problem brings together a number of Computer Science techniques and principles to solve a very “Computer Science’y” problem.
In summary, the problem asks you to build a cache that could mimic the behavior found in hardware caches (i.e. CPU caches) as well as software based caches (i.e. Redis, memcached, etc.). The cache is composed of key:value pair objects.
The problem is essentially structured into 3 areas:
- Intialising a cache of a given size
- Supporting the addition of items to the cache such that if the cache is full, the least frequently accessed item is removed (to make space) and the new item added.
- If the item already exists in the cache, its value is updated and the frequency of that item being accessed is increased by 1
- Supporting the retrieval of items from the cache such that each retrieval updates the frequency at which the retrieved item is accessed, as well as when it was last accessed (and importantly in relation to when it was last accessed, when compared with items accessed at the same frequency)
In the problem description, it mentions that a solution of complexity O(1) is desirable – meaning the retrieval or addition of an item takes the same number of steps, irrelevant of the item or the size of the cache (from a computational perspective – obviously larger software caches may perform slower where hardware caching, virtual memory and paging come into play).
Ignoring the O(1) requirement, I identified two possible solutions:
- Iterative approach (not O(1))
- Doubly Linked-List approach (O(1))
The initial approach that comes to mind is to split the items into buckets based on frequency, with the buckets containing an array of items. The item at the top of the array is the most recently accessed item for that frequency. This is demonstrated in the diagram below:
The issue with this solution is that in order to determine whether an item is in the bucket, you must loop through potentially all the items in the bucket. This prevents the solution from achieving O(1) and is more O(n).
Other issues I encountered with this approach around around maintaining index pointers into the array and the complexity this introduced when items are removed in particular. Essentially, I found a loop was always required (although I’d love to see an O(1) solution for this approach if it exists!).
Doubly Linked-List Approach
In order to achieve O(1), a solution must be designed such that, in simple terms, there are no loops or recursion (which are valid approaches to solving the problem). Whereas the previous solution relied on buckets and loops to look through all the items, an O(1) solution must therefore utilise a different data structure(s).
Those data structures are:
The solution is highlighted in the diagram below:
Two doubly linked-lists are used:
- The first doubly linked-list creates a set of buckets that contain items based on the frequency at which they’re accessed
- The second doubly linked-list (the root of which is the ‘data’ segment of the first doubly linked-list – the frequency bucket) stores the items in order of last accessed
- This is particularly useful when wanting to delete the least frequently accessed item
This data structure on its own does not meet our O(1) requirement; if we wanted to determine whether an item exists, we’d have to traverse each bucket and the items within each bucket (a complexity of essentially O(n)). The solution is simple, maintain a dictionary which points to the items within the doubly linked-list structure.
There are additional minor details which won’t be expanded upon in this post, but I invite you to critique the solution on GitHub.
This problem was definitely less of a thought experiment and more of a problem you may come across during a career as a Software Engineer. My takeaways are as follows:
- Understand when a solution will benefit from a more rigid approach – apart from scribbling a bit-part solution on paper, I started development straight away. The doubly linked-list solution naturally developed and therefore I hadn’t coded for it to be reusable or use a reusable library. This resulted in some code duplication and generally increased the difficulty of debugging the solution. Obviously if I was to productionise a solution such as this, I would refactor to use a library implementation of a doubly linked-list that was well tested.
- Having a basic understanding of computational complexity is important for all software engineers – regardless as to whether you’re developing for embedded systems or consumer websites. Problems can be solved in any number of ways, but its not usually until scaled out (not least in the enterprise) that the way the problem was solved becomes important. By understanding the impact of your solution, you’ll save on some pain further down the road.
- All software boils down to data structures and algorithms. You may have been taught about them at University and now subconsciously use them in your job; or you may have never studied them but as a Software Engineer find yourself using them naturally. Having an understanding of the common data structures (linked lists, stacks, dictionaries, etc.) and how to operate upon them can increase the speed at which you develop reliable, quality code (especially if using libraries). Data Structures and Algorithms are not just topics to be studied for exams – keep them at the front of your mind.
I’ve solved many problems on other ‘coding’ challenge sites such as Project Euler – these problems can often seem too theoretical to take any practical lessons from. It’s been enjoyable to solve a problem that is Computer Science focused – whilst I didn’t learn as much as I may do when compared with a Project Euler problem, it has reinforced some Computer Science theory and the importance of it in your day-to-day job as a Software Engineer.