Main goal of the multi-region failover feature is to provide disaster recovery option for Sitefinity Cloud. This add-on ensures the highest possible uptime of your website and the business continuity of your project.
PREREQUISITES: The failover option is available per environment. You must purchase this option for each environment that you want to provide failover for. This can be your production environment only, but if you have a content staging environment, you can purchase a failover option for it, as well.
To ensure continuous service Sitefinity Cloud provides the failover option, which is based on two regions – Region 1 and Region 2. Region 1 is where your Sitefinity Cloud resides as usual. Region 2 consists of a copy of your production environment and database, where the database is continuously replicated. While Region 1 is up and running, Region 2 is in a stopped state and has only one instance.
If a critical incident happens in Region 1, Region 2 is prepared for failover. The preparation phase breaks the database replication between the regions, warms up the services in Region 2, starts the app service, and spins up an additional instance in Region 2.
Failover option works on region level. It does not work of service level – a service in Region 1 cannot work with a resource in Region 2, and vise versa.
The following chart demonstrates how Region 2 is kept up-to-date with Region 1:
The process is as follows:
- A developer uploads a code change via the Management portal.
- The change is uploaded to the Staging environment of Region 1.
- When tests are complete, the change is promoted to the Production environment.
- The change is, at the same time, deployed to the Production environment of Region 2.
- The Production database is backed-up and restored to the Production database of Region 2.
- The Production database of Region 1 is continuously synced with the database of Region 2.
- The search index of Azure search of Region 1 is synced with the search index of Region 2.
This operation is performed every two hours.
There are two separate instances of Redis cache that are not synchronized.
NOTE: Continuous delivery code deployments are not permitted to Region 2, when Region 2 is the active region. This means that, in case of an incident in Region 1, while your Region 2 is serving the traffic, you cannot download it for development, and, later, upload code changes back to Region 2.
The trigger of the failover mechanism is manual and it requires the on-duty Sitefinity Cloud engineer to manually trigger the process. This is done, so that no undesired failover occurs. For example, if Region 1 restarts or has a minor incident, an automatic failover process will instantly trigger failover, which is undesired. The manual process avoids any unintentional failovers.
The goal is to lower the time required for troubleshooting to the minimum and to restore the service in 30 minutes or less.
After the failover has been completed and the service is running from Region 2, the process of restoring back Region 1, when it is up and running, is again manual.
The following chart demonstrates the failover process:
The process is as follows:
- At time zero, an incident in Region 1 occurs.
- The system sends notification to the on-duty Sitefinity Cloud engineer.
- If the incident is not acknowledged, the system sends notification to the backup on-duty engineer.
When the incident is acknowledged, the engineer, immediately triggers the Prepare for failover pipeline.
The preparation phase warms up and starts Region 2.
- While Region 2 is warming, the engineer troubleshoots Region 1 with the goal to get it back up-and running.
- If the incident in Region 1 is not resolved in 15 minutes, the engineer triggers the Complete failover pipeline.
- This redirects traffic to Region 2 and failover is complete.