Stay ahead of incidents
Monitoring and alerting strategy
Visibility is baked into the VaultScope control panel. Combine the built-in graphs with external exporters so you know about performance regressions before your players do.
Baseline metrics in the panel
- The console header shows real-time CPU, RAM and disk usage averaged across the last 15 seconds. Click any metric to expand historical charts for the previous hour, day or week.
- Use the Activity tab to audit restarts, backups and schedule executions with timestamps.
- Export raw console output when debugging—the download icon saves a text file for offline analysis.
Send metrics to your observability stack
VaultScope supports sidecar exporters and agent-based collectors. Popular setups include Prometheus node_exporter, Influx Telegraf or lightweight StatsD daemons.
- Deploy exporters via the file manager or an automation schedule so they restart after reboots.
- Use custom ports for exporters and expose them via the Network tab if your monitoring stack lives elsewhere.
- Secure exporters with basic auth or mTLS. Never leave plaintext metrics endpoints open to the internet.
Alert routing
Alerts should page the right person at the right time. VaultScope teams typically route alerts using a mix of Discord, PagerDuty and email.
- Define thresholds for CPU, memory, TPS, ping and scheduled task failures.
- Send low-severity alerts to a Discord channel. Reserve PagerDuty or SMS for SLO-impacting incidents.
- Document escalation paths and expected response times in your runbook.
Logging strategy
Treat console output as ephemeral. Ship logs off the server so you can search them during and after incidents.
- Configure log shippers (Vector, Fluent Bit) to push data to services like Loki, Elastic or Datadog.
- Rotate files nightly with logrotate or a scheduled archive task to avoid disk exhaustion.
- Tag logs with the server ID and environment to filter quickly during investigations.
Runbooks and incident drills
Combine metrics and logs with actionable runbooks. If a graph spikes or a process crashes, the team should know the first three steps automatically.
- Link alerts to specific troubleshooting sections—start with the troubleshooting guide.
- Run quarterly game days where you intentionally break staging servers and follow your response checklist.
- Track mean time to acknowledge (MTTA) and mean time to resolve (MTTR) so improvements are measurable.
Compliance and retention
- Retain monitoring data for at least 30 days and security logs for 90 days to support incident forensics.
- Restrict access to dashboards and alert configs to trusted operators with two-factor authentication.
- Back up your observability configuration just like any other critical system—store Terraform or Helm charts in version control.
Questions about integrating a specific monitoring stack? Reach the reliability team at sre@vaultscope.dev.