Of the major challenges that must be faced by SEAT (and all companies with significant
investments in information systems), Disaster Recovery is undoubtedly a priority.
Disaster Recovery refers to the set of technical measures and organisational
processes designed to restore the systems, data and infrastructures necessary to resume business
operations following a severe disruption.
The purpose behind Disaster Recovery is to ensure business continuity, i.e. the ability of a
company to continue to operate its business after catastrophic events.
To accomplish this, systems and business data are typically backed up and stored at a
secondary site so that, in the event of a disaster (earthquake, flood, terrorist attack, etc.) that
renders the primary site unusable, operations can be resumed as soon as possible at the secondary
site, with the least amount of data loss possible.
The main metrics for setting service level objectives with regard to resuming business
operations are:
- Recovery Time Objective (RTO) — the amount of time allowed to fully resume operations based on
the maximum acceptable downtime;
- Recovery Point Objective (RPO) — the maximum amount of time that can elapse between the
creation of data and its replication (e.g., through backup), based on the maximum data loss the
system can tolerate due to unexpected failure.
There are two distinct methods of data replication:
- synchronous replication guarantees that an exact copy of the data is present at two sites. In
the event of a disaster at the primary site, operations can begin almost immediately at the
disaster recovery site (low RTO and RPO near zero);
- asynchronous replication is used to overcome the distance limitations of synchronous
replication and minimise the risks associated with disasters with large-scale repercussions (e.g.,
earthquakes that could affect both sites).
In recent years, SEAT has begun focusing a great deal of attention on these matters, according
to the architecture shown in the attached diagram.
In 2005, a Machine Room was set up at the second site for housing redundant systems, which are
configured to handle the replication of critical business data and allow primary data management to
enable the resumption of operations.
Synchronous replication guarantees that an exact copy of the data is present at the two sites.
In the event of a disaster involving the primary storage unit located at the primary site, the
necessary actions can be taken to have the servers access the critical business data residing on
the storage unit installed at the disaster recovery site; the amount of time necessary to complete
such actions depends on the size of the database, but normally does not exceed 24 hours.