This article was written by team member Aayush Agrawal.

Every industry has its best practices, and we follow them for a simple reason: to walk the path already forged and successfully tested by others. In this blog post, I will describe some of the practices we employ to reduce maintenance and ensure maximum reliability in our hosting of Open edX instances.

Closely follow Open edX releases (and get rid of code drift)

Closely following Open edX named releases, which have release cycles of about six months, is an excellent goal. These named releases also receive upgrades in the form of point releases, which have a simpler upgrade path due to fewer radical changes.

The reason these releases are a good target is that they are fully supported by Open edX. The features and security of the latest named release is always improving, while older versions are not supported, don’t receive bug fixes, and develop vulnerability patches. This is why operators that don’t want to upgrade their instances always have to backport code from newer releases.

One difficulty that might arise when upgrading an Open edX instance is merging conflicts. Any developer trying to merge a large number of changes knows how unwieldy this can be, as the number of changes increases. This, therefore, is a good reason to get rid of code drift.

Closely following the upstream code and minimizing code drift makes the upgrade path much easier when a new named release or point release lands. However, if introducing code drift is unavoidable, it’s better to contribute to it further upstream and get it as part of a new release. We also think one of the best reasons is to give back to the community built around the platform.

At OpenCraft, we closely monitor Open edX releases and upgrade our clients’ instances as soon as possible, with plans starting as soon as the new version is announced This is not possible without getting rid of code drift. This is why we aim to upstream all our code changes to Open edX.

Not only does this procedure benefit the community, but it also directly benefits our company, by reducing the upgrade burden, and our clients, by offering them faster and more affordable upgrades.

Identify issues before your learners (and keep monitoring)

Nothing is as frustrating as going through your online course just to stumble into an issue in your LMS.

Yet, maybe more frustrating is being an Open edX provider and getting a message from one of your learners trying to describe an issue. The easiest way to prevent this situation is finding issues before your learners. Here are three strategies for that.

Having A Staging environment

We strongly recommend deploying a staging environment, which is a safe place to test changes and let them break. A staging instance representative of production is a good place to catch issues that might have been missed during development.

One gotcha here is having a “reduced” representation of the Production environment. For example, you might not set up SSO, or MongoDB failover on the staging instance. This can be dangerous, as you might be deploying a breaking change into production fully believing everything will work as intended. We therefore recommend having a staging environment that is as close to Production as possible.

Testing Changes Yourself

After deploying changes to the platform, ensure that you test the whole instance. While we might think we can test only parts of the platform touched by a code change, correctly finding out what is touched by the change is actually very hard in a platform with the size of the Open edX LMS. To help with testing, OpenCraft developed a Manual Instance Checklist.

Monitoring Strategies

Monitoring is an automated way to identify issues. Strategies for this abound, and many services can be used: hardware monitoring, VM monitoring, process monitoring, performance monitoring, endpoint monitoring, etc. This topic will be more deeply investigated in a future blog post; stay tuned!

One gotcha is having “too much noise” in your monitoring. A monitoring plan that’s too strict and with frequent alerts can lead to desensitization to alerts. This can be dangerous, and can, in turn,  create issues that are not checked, and if an alert doesn’t get checked, it’s a waste of time and effort.

Redundancy, redundancy, redundancy (and more)

Anything can fail without warning on the Internet, including servers managed by others. One important way of making your Open edX instance more resilient is increasing its redundancy by ensuring that critical system components have another identical component with the same data that can take over in case of failure.

I won’t get into the details of MySQL and MongoDB replicas and failover as you’ll find some great writing on the subject. For example, check this article.

Here, I specifically want to bring attention to the importance of getting rid of single points of failure: single database server, single load balancer, single application server, etc.

This is also interesting from a performance perspective, being used for horizontal scalability.

Encryption (and security practices)

You can have all the patches in the world, but without encryption, an attack on your infrastructure can reveal everything anyway. This is why encryption is necessary if you are hosting anything, especially the Open edX platform, as it stores Personally Identifiable Information.

You can find more information on encrypting your Amazon S3 Object Storage, for example, in our previous blog post.

Besides encryption, following Web applications security practices is very important to prevent headaches in the future. This is another topic that is vast and has been already explored in articles around the internet, and each environment will have its own requirements.

What you should look for here is reducing your data and application servers’ exposure to a minimum, while hardening the servers that are required to be exposed, such as bastion hosts and load balancers.

At OpenCraft we host and maintain OpenEdX instances for many organizations including Harvard University, Arizona State University (ASU), Cloudera, and many more. If you’d like to avoid the headache and have your instance professionally managed, reach out to us at https://opencraft.com/contact-us/

Cover Photo by Markus Spiske on Unsplash