Operations & Best Practices

Daily Operations

Quick Health Check


# Check all service statuses
sudo systemctl status backup-*
 
# Check disk space
df -h /var/lib/backup/repo
 
# View recent snapshots
xreplicator snapshots --server localhost:50051

Monitor these daily:

Service status for all components
Logs for errors or warnings
Disk space on the backup server
Confirmation that backups are completing

Backup Strategies

Full Backup Frequency

Frequency	Recommended For
Daily	Critical systems with high change rates
Weekly	Most production systems (recommended)
Monthly	Archival or low-change systems

Incremental Backup Frequency

Frequency	Recommended For
Hourly	Critical databases, high-change systems
Every 2–4 hours	Typical production systems
Daily	Low-change systems

Retention Policy Guidelines

Full backups: Keep 4–12 (1–3 months of history)
Incremental backups: Keep 24–168 (1–7 days of hourly snapshots)
Adjust based on storage capacity and recovery time objectives

Maintenance Schedule

Weekly

Review backup logs for errors
Check repository disk space
Verify cloud sync status (if configured)
Test a restore from a recent snapshot

Monthly

Review and adjust retention policies
Run compaction (if not automated)
Verify license expiry date
Review and update configurations

Quarterly

Perform a full disaster recovery test
Review and update backup strategies
Audit access and permissions
Update software packages

Azure-to-Azure DR Operations

XReplicator v1.3.0 can be used to set up DR for Azure workloads recovering into Azure. The recommended operational model is:

Run the backup server and web UI in the Azure DR landing zone.
Enable DR for each protected source disk.
Wait until every source shows a healthy DR status and the applied snapshot matches the desired snapshot.
Create a blueprint that maps source disks to the target Azure region, resource group, VNet, subnet, VM size, and disk strategy.
Run strict precheck before every drill or production failover.
Trigger failover from the web UI and validate VM boot, disk attachment, networking, and application health.

Observed Azure-to-Azure failover drills can start the recovered server in under 2 minutes once staging disks are already healthy and synced. Treat this as a measured environment result, not a universal guarantee: RTO depends on Azure API latency, VM size, OS boot time, network dependencies, and application startup.

For setup details, see Azure-to-Azure DR.

Key Metrics to Monitor

Metric	Alert Threshold
Backup success/failure rate	Any failure
Repository disk usage	80% full
License expiry	30 days before expiry
Agent connectivity	On disconnection
Cloud sync status	On failure (if configured)

Best Practices

Configuration

Use a consistent fixed_block_size_mb across all agents and the server
Keep chunk_size_avg_kb consistent to preserve deduplication
Enable TLS for all production gRPC connections
Store cloud credentials securely — avoid hardcoding in config files

Security

Restrict network access to the backup server (port 50051)
Use TLS encryption for gRPC communication
Use strong, unique credentials for cloud storage
Rotate credentials on a regular schedule

Performance

Match pipeline settings (workers, batch_size, max_pipeline_memory_mb) to available resources
Use compression for network-backed storage
Enable eBPF change tracking for faster incremental backups
Monitor and tune batch sizes based on actual network throughput

Reliability

Test restores regularly — a backup you haven’t restored is an untested backup
Maintain multiple full backups
Use cloud sync for offsite/disaster recovery backups
Document your recovery procedures and keep them up to date

Documentation

Maintain a list of all backed-up systems and their schedules
Document the restore procedure for each system type
Keep copies of configuration files in version control
Record any configuration changes with the reason and date