What is data replication?
Data replication is when the same data is intentionally stored in more than one site or server. There are several reasons companies replicate data. It allows data to be available seamlessly in the case of a server downtime or heavy traffic to the server. Data becomes accessible to users consistently while not interfering or slowing down other users’ access. For cloud applications, data replication allows you to access a copy of the data on a local database with much higher performance than accessing the data through the cloud application’s API, which is especially useful for analytics and data science. Data replication can also allow you to avoid API transaction limits and throttling that some cloud applications have.
Data replication is not the same as a normal backup and does more..The server where the data originates is called the publisher and the one to which the data is replicated is called the subscriber. In data replication, the transaction done on the publisher syncs with the subscriber and updates data automatically. Any data changes on the publisher is automatically reflected in the subscriber as well.
There can be full replication or partial replication. In case of full replication, all of the source data is stored at all replicate sites. In case of partial replication, only the frequently accessed data is copied, while other data remains at the source.
Types of data replication
There are three main types of data replication.
Transactional replication
In transactional replication, frequently occurring data changes are automated and distributed between servers. The replication of changes from the publisher to the subscriber happens in near real-time. It does not simply replicate the final result of the transaction but records every step of the transaction and the order in which the changes occur.
For example, in the case of ATM transactions, the replication from the publisher to the subscriber is not just the end balance recorded but all the individual transactions made in between. Another key feature of transactional replication is that while the data changes at the publisher are replicated at the subscriber, it doesn’t work the other way around. Data changes do not occur at the subscriber level by default.
Snapshot replication
Snapshot replication synchronizes data between the publisher and subscriber at a given point of time. It moves chunks of data from the publisher to subscriber in a single transaction. The updates in a snapshot replication are not as frequent as transactional replication. It can be done prior to transactional replication to establish a base state of the two servers. It does not update every transaction between the servers nor the order of data change.
This process is used to synchronize data that changes over a period of time. For example, many companies replicate data such as accounts, contacts, and opportunities from a cloud CRM to a local database for reporting purposes. This may be done every 15 minutes, once an hour, or once a day, depending on how frequently the data changes. For efficiency, the replication process may detect the data that has changed in the publisher and replicate only the changes instead of taking a full snapshot at every replication interval.
Merge replication
Merge replication is a slightly more complex form of replication. The initial synchronization from the publisher is a snapshot replication. However, in this form, it allows data changes to occur at both the publisher and subscriber level. The updated data is then sent to a merge agent, which is installed on all servers. The merge agent uses conflict resolution algorithms to update and distribute the data.
For example, if an employee was editing a document directly saved on a cloud server (publisher) on their laptop or phone (subscriber) while being online, that would be transactional replication since the document is saved in near real-time. However, if the document was downloaded from the cloud server and updated offline on the laptop or phone, there would be a conflict, since the data was updated at the subscriber end. Once back online, it would go through a merge agent, which would use a conflict resolution system to update the document at the publisher by comparing the two files.
Merge replication is used in multiple scenarios where a user does not have direct access to the publisher at all times like in the case of mobile users where it is possible to go offline while the data is updated. It would also be used in a case where multiple subscribers might access and update the same data at various times and sync it with the publisher or with other subscribers. It could also be used where the same data from the publisher is updated in parts by multiple subscribers at the same time.
Important components of data replication network
Besides the publisher and the subscriber, the network also has a few key elements that are required to be successful.
Distributor
To change a replication setup, the distributor needs to be configured first. The distributor is a server in the replication network that controls the distribution database and stores the metadata and history of all replication. It also stores transactions and snapshots meant for replication. The distributor can be the same as the publisher server (local distributor) or even a remote distributor depending on the type of replication.
A remote distributor is generally used when a single distributor caters to multiple publishers or when it is required to do processing on a separate computer to not affect the data movement on the publisher. This is typically used in transactional replication where the frequency of updates is far higher and could affect the processing speed of the distributor if on the same server. Merge replication can be done with a local distributor since the frequency of replication and updates is not as frequent as transactional replication. It also helps that the updated data from the subscriber after merging has to be synced with the publisher eventually.
Replication agents
Replication agents are at the heart of replication and vary depending on the type of replication installed. They are programs that carry out various tasks such as detecting and updating publisher and subscriber databases, creating copies, and conflict resolution. The replication agents are generally stored in the distributor. Some of the replication agents run from the distributor are:
- Snapshot agent
- Distribution agent
- Merge agent
- Log Reader agent
- Queue reader agent
Advantages of data replication
Data replication is a great way to provide consistent access to data. It also increases the access to data to multiple users at the same time. Data redundancies are removed by merging databases and updating slave databases with incomplete data. And with data replication, there is naturally faster access to databases.
Disadvantages of data replication
Data replication requires large storage space and infrastructure to maintain it. Replication is expensive and maintaining infrastructure to keep data consistent requires complex measures. It also opens up more parts of software to privacy and security breaches.
Best practices for replication
Once the replication network is configured, it is important to follow some effective administration practices:
- There should be a strategy in place to back up a database on a regular basis. There should also be regular tests run to restore those backups.
- As a part of disaster recovery, it is essential to script all replication components and repetitive tasks since scripts can be stored and backed up easily. In case of a change in policies, the components can be easily re-scripted.
- It is necessary to establish factors that affect the performance of replication. This includes aspects such as hardware, database design, network configuration, server configuration, and agent parameters. These all need to be put in place and monitored for the workload used by the application.
The following five parameters should be monitored for efficiency:
- Time required for replication
- Replication sustained over a period of time
- Concurrency or number of replication activities that can happen at the same time
- Duration of synchronization
- Resource consumption for replication
- To avoid disasters, it is important to establish thresholds for performances so that when reached, it can generate warnings and send alerts to the administrators. Alerts can also be established for actions of replications agents or replication processes.
- Monitor replication topology
- Periodic validation of data for transactional replication and merge replication needs to be done at publisher and subscriber to maintain successful replication
Data replication goes beyond creating a data backup. While being one of the key components in data management, it can also be an expensive and complex process if not managed properly. The key lies in choosing the right replication process for your needs.