I now have started another crucial component of the project, i.e a caching mechanism. So far I have been able to simulate the mounting remote filesystems using SSHFS, that provides a convenient way to interact with files in your remote storage just like how you would normally interact with files in a your local Linux filesystem. But in behind the scenes, there is an underlying network protocol for interacting with the remote servers, which is SSH for the case of SSHFS. Any reads or writes to the remote filesystem goes through a network.

That is all great but often the connection between two nodes across a network are not always stable, or there maybe unforeseen issues on either sides. Thus, there is always some form of latency in the communication between those two nodes. This is where caching comes into play. A cache is basically a temporary data storage, which means that queries and data retrieval can be done through this cache, reducing the need of having to go all the way to a remote server for the same purpose. The use of a cache helps to address potentially high network latency issues between our server application and a user’s cloud storage service. Since the cache is only meant to hold data temporarily, it needs to expire its entries based on some mechanism which I would touch on later.

Apart from that, the cache could also aid the application in unmounting the user’s remote storage automatically on expiration. At the moment, the unmounting only happens when a DELETE request is sent to the REST API cache endpoint, but it is not guaranteed to be always successful, therefore the remote filesystem could remain mounted potentially for a really long time!

Fortunately, a previous intern already figured out how to implement the cache using the Poco C++ libraries. All I have to do is simply understand how it works and then make the necessary modifications to it for our current purposes. Poco’s Cache framework is just an std::map, except that it has the required cache capabilities like having a size limit and the mechanisms or strategies for expiring entries. It supports two common mechanisms for expiring cache entries, i.e Least Recently Used (LRU) and Time based Expiration. LRU strategy allows for a limit on a particular size, hence if the cache is already at full capacity but has a new entry, then the LRU algorithm would just replace the entry that has not been accessed for the longest time with that new entry. Time based Expiration, on the other hand, simply just removes entries after a specified time interval. We would used both for our server application. The application’s cache actually would just hold the meta data for the user’s files, such as its file path in the local machine which points to a designated directory where the user’s project and database file would be.

However, in our case, we would not want to entries to just merely expire. On the event of expiration, we would want to write back those files back to the user’s storage. This is how write-back schemes work. Poco’s Cache Framework enables us to define our own strategy to do that. There’s an issue of not being able to access the actual values in the cache, even using the keys to read the value internally have some unintended modifications to the cache. The previous intern was able to solve the problem by having another std::map that mirrors the underlying map structure within the framework. His implementation at that time interfaces with Dropbox, but this is where I have to make the changes such that it works with all of our supported protocols.

For this week, I modified the CREATE part of the cache endpoint such that it not only mounts the remote filesystem on login, but also would be used to add entries to the cache, i.e when the user creates a new project. As the cache adds in new entries, internally in needs to keep track of its current size. Dr Shawn informed me that if our application becomes widely used, the cache could quickly be filled to its brim. When its at full capacity and new entries keep coming, that’s where LRU kicks in but doing that would make the server application mount and unmount continuously in our case, which is not what we want. So the application should inform us if the size of the cache is almost full.

I had to find out potential ways the application could to this. For example, if I were to remove the LRU strategy, would it throw an exception when adding a new entry to a cache that’s already at full capacity? Then our application would could just handle that and inform us subsequently. However, by removing the LRU strategy, the cache would just act just like an std::map, hence entries can be infinitely added. Then, how about whether it could dynamically resize? Unfortunately, the framework does not allow you to do that. At least for now, I could keep track of the size as new entries get added to cache in our self defined strategy, and then proceed to log a message when the cache is already at 90% of its capacity.

Other than that, I also managed to implement the manual flushing of the cache on DELETE. Next, I would need to change the READ and UPDATE operation of the endpoint as well. When READING, if the queried file is still in the cache then just use that file, otherwise mount the remote storage if needed and copy the file into the cache. The same would be done on UPDATE as well, except that the new changes would be applied to that particular file. Adios for now.


Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.