This article was written by Farhaan Bukhsh, senior open source developer at OpenCraft.
We at OpenCraft, while maintaining Open edX instances for our clients, often run into unique problems that have quite interesting solutions. Sometimes the journey leading to these solutions ends up teaching us a lot about software engineering itself. Through this blog, we want to share the experience of one such journey that we undertook when one of our clients told us, ”It is taking too much time to load a course”. To get to the bottom of the problem, we asked some questions, which led us to the core issue of the Course Outline page taking too much time to load.
The first thing to do in situations like these is to find out what exactly is taking so much time and where exactly the execution flow is stuck — and then figure out a solution for it. We needed a profiler for the code to figure out exactly which part of the system needs our attention. We used django-silk, a handy profiling and inspection tool for the Django framework.
While profiling the endpoint we noticed that the code responsible for fetching and transforming the course blocks was taking the largest amount of time. We figured this out while walking through the call graph that we generated while profiling the code.
Course Block Alteration
Once we figured out that this operation being done on the course block was the culprit, we got into looking for a solution to reduce the time taken to load the course. The first solution we tried was to decrease the depth of the course block so that parsing and transforming the course blocks would take less time.
Having changed the nav_depth from 3 to 2, we saw a considerable drop in the load time of the request. However, this turned out to be a false positive since the first request to load the course was still taking a lot more time than subsequent requests.
This tickled our spider senses. We figured we were still unable to see the whole picture. After fiddling and tracing out the Open edX codebase we found that once the course blocks are generated for the first time they are cached, and hence the first request takes so much more time than subsequent ones. Now, the puzzle was unraveling itself one piece at a time. We figured out that each time a course was visited, there was a request to generate the course blocks. The platform first checks the cache, and if it's a “miss”, it goes on to generate the course blocks and populate the cache.
This cache is stored in the Memcache cluster and has a validation period of 24 hours. Upon further investigation,we found out that there were actually multiple courses for which the cache was being evicted. Now the two possible reasons for Memcache to evict a cache are:
- The timeout for the cache expires
- There is no memory left in the Memcache cluster
Hence, our immediate reaction was to first disable the cache Timeout and second to vertically scale our Memcache cluster. Even though these steps did indeed help preserve the cache for the smaller courses, there was a particularly large course for which the course block was still getting generated on every request.
As it turns out, although our solution above theoretically looks amazing, we didn't take into account the way Memcache internally stores data. Memcache uses slabs of 1MB to store data and any data exceeding this size limit gets evicted by default.
This left us baffled and racking our brains to find a solution. We were simultaneously discussing this problem on discourse. We found out that doing the above-mentioned arrangements does give an illusion of permanent storage but doesn’t really solve the problem; it might even create bigger ones when a site starts getting more traffic.
Caching and Tiering
Open edX has various strategies already built-in to handle such scenarios, we just need to find a way to leverage them to our advantage. Dave pointed out that we can use EDXAPP_BLOCK_STRUCTURE_SETTING to optimize the load time and have a permanent solution in place. This setting introduces a tiering solution, with the generated course block being stored permanently in an S3 bucket. Now when a user requests for a course, the platform first checks Memcache for course block and,in case of a miss, it checks the S3 bucket and fetches the course block from there.
This cache is refreshed/regenerated when a course is edited or when the admin command given below is executed.
We are using S3 bucket as the storage strategy.
After introducing the above set we had to regenerate the cache by running the following command:
./manage.py lms generate_course_blocks --all_courses --with_storage
There are a few switches we need to activate:
Then we introduced a BlockStructureConfiguration with version 1 and cache expiration set to None.
This arrangement helped us to improve the load time of the page and saved us a lot of processing power.
The above configuration helped us optimize the request time. We ran load testing on a few of the servers and it had indeed improved the performance of the platform. The request that previously took ~10 seconds to load was reduced to only ~3 seconds.