Skip to Content
Operations

Daily Operations

Quick Health Check

# Check all service statuses sudo systemctl status backup-* # Check disk space df -h /var/lib/backup/repo # View recent snapshots xreplicator snapshots --server localhost:50051

Monitor these daily:

  • Service status for all components
  • Logs for errors or warnings
  • Disk space on the backup server
  • Confirmation that backups are completing
  • DR source health for protected Azure workloads

Backup Strategies

Full Backup Frequency

FrequencyRecommended For
DailyCritical systems with high change rates
WeeklyMost production systems (recommended)
MonthlyArchival or low-change systems

Incremental Backup Frequency

FrequencyRecommended For
HourlyCritical databases, high-change systems
Every 2-4 hoursTypical production systems
DailyLow-change systems

Retention Policy Guidelines

  • Full backups: Keep 4-12 (1-3 months of history)
  • Incremental backups: Keep 24-168 (1-7 days of hourly snapshots)
  • Adjust based on storage capacity and recovery time objectives

Maintenance Schedule

Weekly

  • Review backup logs for errors
  • Check repository disk space
  • Verify cloud sync status (if configured)
  • Test a restore from a recent snapshot
  • Review Azure DR sources for pending, degraded, or stale sync states

Monthly

  • Review and adjust retention policies
  • Run compaction (if not automated)
  • Verify license expiry date
  • Review and update configurations
  • Run at least one non-production DR precheck for critical Azure blueprints

Quarterly

  • Perform a full disaster recovery test
  • Perform a controlled failback test for workloads with Azure DR enabled
  • Review and update backup strategies
  • Audit access and permissions
  • Update software packages

Azure-to-Azure DR Operations

XReplicator v1.3.1 supports Azure-to-Azure DR operations from the web UI:

  1. Run the backup server and web UI in the Azure DR landing zone.
  2. Enable DR for each protected source disk.
  3. Prepare newly mapped DR target disks with wipe/zero or explicitly continue without wiping.
  4. Wait until every source shows a healthy DR status and the applied snapshot matches the desired snapshot.
  5. Create a blueprint that maps source disks to the target Azure region, resource group, VNet, subnet, VM size, and disk strategy.
  6. Run strict precheck before every drill or production failover.
  7. Trigger failover from the web UI and validate VM boot, disk attachment, networking, and application health.
  8. Use the failback workflow to prepare mappings, check primary disks, sync data back, and complete return to primary.

Measured Azure-to-Azure drills on a VM with a 30 GB OS disk and a 4 GB data disk completed failover in under 40 seconds with attach-as-is disks and around 70 seconds when creating from snapshots, once staging disks were healthy and synced. Treat these as measured environment results, not universal guarantees.

Choosing a Failover Strategy

StrategyWhen to useOperational impact
Attach as-isFastest recovery from already-synced DR disksLowest measured RTO; recovered VM uses the staged disks directly.
Create from snapshotSafer isolated copy for drills or controlled recoveryAdds snapshot and managed disk creation time before VM creation.

Drill Checklist

  • Confirm all selected sources are healthy.
  • Run precheck and resolve blockers.
  • Record failover start, VM-created time, OS-ready time, and application-ready time.
  • Validate disk mounts, application health, network access, DNS, and security rules.
  • Review operation history and row-level logs.
  • Prepare failback in a controlled window and run primary disk checks before syncing back.

For setup details, see Azure-to-Azure DR.


Key Metrics to Monitor

MetricAlert Threshold
Backup success/failure rateAny failure
Repository disk usage80% full
License expiry30 days before expiry
Agent connectivityOn disconnection
Cloud sync statusOn failure (if configured)
DR source statusPending/degraded beyond expected sync window
DR precheckAny blocker before a planned drill

Best Practices

Configuration

  • Use a consistent fixed_block_size_mb across all agents and the server
  • Keep chunk_size_avg_kb consistent to preserve deduplication
  • Enable TLS for all production gRPC connections
  • Store cloud credentials securely; avoid hardcoding in config files

Security

  • Restrict network access to the backup server (port 50051)
  • Use TLS encryption for gRPC communication
  • Use strong, unique credentials for cloud storage and Azure service principals
  • Rotate credentials on a regular schedule
  • Scope Azure DR permissions to the required landing-zone resource groups

Performance

  • Match pipeline settings (workers, batch_size, max_pipeline_memory_mb) to available resources
  • Use compression for network-backed storage
  • Enable eBPF change tracking for faster Linux incremental backups
  • Monitor and tune batch sizes based on actual network throughput
  • Measure attach-as-is and snapshot-based RTO separately for each protected workload

Reliability

  • Test restores regularly; a backup you have not restored is an untested backup
  • Maintain multiple full backups
  • Use cloud sync for offsite/disaster recovery backups
  • Keep DR target disks raw, dedicated, and unmounted outside XReplicator
  • Document recovery and failback procedures before the incident

Documentation

  • Maintain a list of all backed-up systems and their schedules
  • Document the restore procedure for each system type
  • Document Azure DR blueprints, target resource groups, network dependencies, and failback owners
  • Keep copies of configuration files in version control
  • Record any configuration changes with the reason and date
Last updated on