This is my first post. I am somewhat new to Informix, but hopefully I can get this right.
We are evaluating the HDR feature of Informix as a method for DR for a warehouse distribution application.
We have 19 distribution centres - each DC runs within a single database, on a single Informix instance. The all run the same 3rd party application software.
We have 4 production p550s. Each p550 server runs 4-5 Informix instances. We also have 4 standby p550s for DR. Each p550 has 16Gb memory and 4 CPUs.
We have 2 datacentres, located approximately 1200 miles apart. Each datacentre has 2 of the 4 production AIX servers. Each datacentre also has 2 of the standby servers for DR. The standby servers are configured identical to the production servers, including each of the Informix instances.
We have a telco MPLS WAN circuit between the two locations, with 100Mbps capacity.
Here's what we have done so far to test this:
1. We reviewed all of our 19 production instances to determine which distribution centre had the highest log usage and when. We identified an application weekly purge process that runs on the weekend that has a logical-log consumption rate of about 3.3 Gb of logical-logs per hour. (70 logs/hr, LOGSIZE=50000, LOGFILES=64). This weekly purge process generates logical-log transactions at this rate for about 1 hour. Other peak log consumption during the week has been mapped to a specific transaction process, which is about 1/4 of the log usage rate as the weekly purge process, or about 17 logs/hr. All other transaction activity within the Informix instance averages about 4 logs/hr.
2. We decided to benchmark the weekly purge process, primarily to see if the MPLS WAN circuit could keep up with the amount HDR traffic when the purge was running. We set up the Informix instance on a development p550 in 1 datacentre, and activated HDR to another Informix instance on a development p550 in the other datacentre (DRINTERVAL=30, DRTIMEOUT=30). The development Informix instance contained a copy of the production instance data, just prior to the weekly purge running in production.
3. We then ran the weekly purge in development, with HDR activated to the Informix instance in the other datacentre. The application process ran for approximately 2-1/2 hours. While the process was running, we monitored the log usage on the primary and secondary servers, at 1 minute intervals (onstat -l, then get the current log and pages).
4. When the application purge finished, we then deactivated HDR, restored the database, and re-ran the same weekly purge process, without HDR.
Here is what we have found so far:
1. Network statistics revealed that the transmission rate between the two servers was about 3.8Mbps during our test run of the purge process with HDR enabled.
2. Our sampling of the logical-log usage on the primary and secondary servers show that they were, for the most part, within 200 pages of each other (we ran the onstat -l command on the secondary server first, then on the primary server - both within a second of each other). There appeared to be only one timeframe where the primary server was 30-45 seconds ahead of the secondary server, and a sample 1 minute later showed that the secondary appeared to have "caught up".
The above findings are very promising, as it appears that our network would be able to support HDR on our largest Informix instance, even during peak application activity. Previously, we thought that the network would not be able to keep up.
Here are some other findings:
3. The AIX server had 6-8% higher CPU usage while the weekly purge was running with HDR enabled, than without HDR. With HDR enabled, the average CPU usage was about 68% of total CPU, while without HDR, ~60%
We believe that we have sufficient capacity in the production environment to support this additional CPU usage.
4. The elapsed run time of the weekly purge was approximately 50% longer when HDR was activated, than without.
I am at a loss to explain why this is. I ran the test several times, and could not get the HDR enabled test to come even close to the non-HDR test. I thought that maybe the checkpoint duration may have played a role, but the log showed 0-1 seconds for the checkpoints, which was the same in both types of tests.
5. One night, I left HDR running all night. In the morning, when I came in and reviewed the online log, I found that the primary server had reported 5 occurrences where it had a network error - ping, send, or receive - and then turned HDR off. In all 5 cases, HDR was re-established automatically anywhere from 10 seconds to 2 minutes later.
However, what I noticed was that in some of these instances, the checkpoint duration increased by as much as 202 seconds to complete the checkpoint. I discovered that the checkpoint could not complete until the HDR re-sync was completed, and the checkpoint was applied on the secondary.
Now (after all that pre-amble), here are my questions:
1. Does this look like a valid test scenario to test HDR? Is the onstat -l on the two Informix instances a good measure of "logs applied"?
2. What can I look at to try and figure out why the application took longer to run with HDR? Is there something that is tunable in the database that may be affecting it? Or is it just the additional overhead to maintain HDR?
3. Is there anything I can do to try and mitigate the effect of network "hiccups"? The application is 7x24, and I anticipate that a network interruption of even seconds could potentially impact response time - especially if a checkpoint needs to be taken while HDR is being re-established.
4. Does the amount additional CPU that HDR uses dependent upon the transaction volumes? That is, if I use 6-8% more CPU when I run the most logical-log intensive process, I would use less CPU during less active times?
Any help would be greatly appreciated!