Our methodology is based on several critical measures that help clients express their availability requirements. These measures include the recovery point objective (RPO), the recovery time objective (RTO), and the completion of processing objective (CPO). These measures play a major role in first two steps of the design process: creation of a risk profile and quantifying financial impact of the risks.
RPO is the length of time between the last available backup (checkpoint) a possible disruption (outage). Time of the last checkpoint corresponds with the state to which IT systems and data must be restored following an outage. The RPO determines the amount of data that may be lost due to an outage and its value should be constrained to avoid serious consequences for the business. Its value varies from seconds (for a stock-trading application) to days (for some manufacturing environments). The RPO value determines the type of backup and data replication technology to be used.
RTO is the interval of time following an outage, in which IT systems and data must be restored to the last checkpoint. This is the amount of time following an outage, beyond which a non-functioning computer system or business application is considered intolerable for the enterprise. Its value can vary from seconds to days, depending on how critical the application is to the organisation. The RTO value determines the type of load-balancing and recovery technology to be used.
CPO is the interval of time following an outage within which all processing backlog must be caught up. This is the amount of time after an outage until normal operations are restored. We point out that this value is critical for those enterprises that have regulatory reporting requirements. Its value varies from days to weeks. The CPO value is used to size the backup processing facilities. It is also a factor in determining the RTO value.
How many data centres are required?
The number of data centres depends on the risks identified in a company’s risk profile and the availability requirements of its mission-critical business processes. A single data centre is known to be vulnerable to many types of risks, even with advanced high-availability features, such as dual power supply systems.
If all mission-critical system components were placed within a single data centre, the data centre itself would become a single point of failure. It is highly unlikely that the availability target could be met over such business system’s life, given the risks associated with a single-data-centre configuration. Therefore, a minimal configuration of two active data centres provides the basic geographical dispersion essential to achieving long-term availability targets.
How far apart should data centres be placed?
Data loss tolerance and data communications network latency play a major role in determining data centre separation distance. A requirement of zero data loss implies that replication of data to a remote data centre (secondary site) must be synchronous. In that case, if immediate confirmation of a successful replication is not received, the application thread initiates corrective action and data loss is prevented.
A requirement of minimal data loss implies that a small delay is acceptable between a data update at the primary site and its replication at the secondary site. When the primary site fails, a small number of database updates at the primary site may be lost in-flight (incomplete or failed replication). Data communications network latency increases with distance.
To maintain acceptable transaction response times for the application in zero-data-loss business systems, network latency is bounded. This places a limit on separation distance between data centres. The limit is technology-dependent and in most cases does not exceed a fiber-cable distance of 100 km. It can be greater for minimal-data-loss business systems, for which asynchronous data updates are used.
Therefore, when adopting a particular data centre topology, one has to decide how business data replication will be synchronised across two or more sites. In all replication schemes, data updates are transmitted over a data communications link between two sites. A general understanding of data synchronisation technologies is helpful for understanding the constraints placed on data centre topologies. Selection of a specific technology also requires an understanding of how the database is used, the update frequency and other details that are beyond the scope of this paper.
Following are four most common data replication approaches:
-
Synchronous disk replication—a hardware technology in which data writes made to disk storage (usually redundant arrays of independent disks, or RAID) at the primary site are simultaneous with writes of the same data made to a disk storage array at the secondary site. The communication links have high bandwidth and low latency. Therefore, the distances they cover are rather short. A write operation is not considered complete until acknowledgements are received from both local and remote disks (this ensures zero data loss).
-
Asynchronous disk replication—a hardware technology in which data writes made to the disk storage at the primary site are also queued to be sent to the secondary site. The write operation is considered complete when acknowledgement is received from the local storage manager that handles data replication. The communication links have lower bandwidth. Therefore, the distances they cover are greater. Should the primary site fail, the queued data may be lost and not replicated at the secondary site.
-
Asynchronous database replication—a software technology in which data updates made to the database at the primary site are also queued to be sent to the secondary site. The communication links have lower bandwidth. Therefore, the distances they cover are greater. Should the primary site fail, the queued data may be lost and not replicated at the secondary site.
-
Replication by the application software—data replication is managed by the application. This type of data replication is highly dependent on the application. Some require that a basic replication operation is complete in milliseconds, while others allow completion to take days.
Business constraints may also influence data centre site separation
Regulatory bodies prescribe best practices intended to mitigate IT operational risk for certain industries and expect compliance. An example may be drawn from a document intended to increase the resilience of the U.S. financial system, published by two U.S. government agencies in response to the September 11, 2001, terrorist attacks.
The paper recommends that ‘‘backup arrangements should be as far away from the primary site as necessary to avoid being subject to the same set of risks as the primary location’’ without specifying the minimum acceptable distance.
A similar document, authored by a forum of the European banking industry, states: ‘‘First, an organisation should take care that its alternate site is sufficiently remote from its primary business location and, where possible, does not depend on the same physical infrastructure components. This minimises the risk that both could be affected by the same event. For example, the alternate site should ideally be on a different power grid and central telecommunication circuit from the primary business location.’’
More specific wording is used in a regulation issued by the European Central Bank. It states: ‘‘When both primary and backup sites depend on the same labour pool or infrastructure components (transportation, telecommunications, water supply, and electric power), large-scale events could render both sites inaccessible or inoperable. This emphasises how important it is for systems to ensure an appropriate geographic separation between the primary and the secondary site. Therefore, the dependence of the second processing site on the same critical infrastructure components used by the primary site (telecommunications, water supply, and electric power) should be the minimum compatible with the stated recovery objectives.
Furthermore, geographic separation may not be sufficient, especially in scenarios involving terrorist attacks. Indeed, terrorism means that sites can be targeted regardless of their location.”
Part 3 of the series (coming Friday) will explore two-, three- and four-site topologies.
Authors: Richard Cocchiara, Distinguished Engineer and the Chief Technology Officer for Business Continuity and Resiliency Services at IBM
Dr. Hugh Davis, Lead Architect in IBM’s Global Business Resilience Consulting Practice
Doug Kinnaird, Executive IT Architect in IT Strategy and Architecture Practice at IBM