Monitoring Setup
Cloud OS includes built-in monitoring for system and application metrics. For fleet deployments, it also supports exporting metrics to external tools.
Built-in Monitoring
The Cloud OS dashboard provides real-time views of:
- System metrics — CPU, memory, disk, and network usage
- App metrics — per-container resource consumption
- Disk health — S.M.A.R.T. status and partition usage
- Audit logs — administrative actions with timestamps
- Alert history — triggered and resolved alerts
For single-instance deployments, the built-in monitoring is often sufficient.
Alert Channels
Configure notification channels to receive alerts when metrics cross thresholds. Supported channel types:
| Type | Description |
|---|---|
| SMTP-based notifications | |
| Slack | Channel notifications via webhook |
| Telegram | Bot notifications |
| Webhook | Custom HTTP endpoint |
| PagerDuty | Incident management |
Configure alert channels from the Settings > Alerts section of the dashboard.
Configure at least two alert channels for redundancy (e.g., Slack and email). Test channels after setup to verify delivery.
Common Alert Rules
| Condition | Severity |
|---|---|
| CPU usage above 90% for 5 minutes | Warning |
| Memory usage above 90% for 5 minutes | Warning |
| Disk usage above 85% | Warning |
| Disk usage above 95% | Critical |
| App container crashed | Critical |
| Backup overdue (over 12 hours) | Warning |
| Security score below 70 | Warning |
Create and manage alert rules from the dashboard.
Disk Health
Cloud OS monitors disk health using S.M.A.R.T. data to provide early warning of hardware failures. Automatic alerts are generated for bad sectors, overheating, and SSD wear levels.
External Integrations
For fleet deployments or advanced monitoring, Cloud OS can expose a Prometheus-compatible metrics endpoint. Add the Cloud OS instance as a Prometheus scrape target and use Grafana for dashboards.
Recommended Stack
| Component | Tool |
|---|---|
| Metrics | Prometheus + Grafana |
| Logs | Loki + Grafana |
| Alerts | Grafana Alerting or PagerDuty |
| Uptime | UptimeRobot or similar |
Metrics Retention
Built-in metrics are stored with decreasing resolution over time:
| Resolution | Retention |
|---|---|
| Full resolution | 24 hours |
| 1-minute averages | 7 days |
| 5-minute averages | 30 days |
| 1-hour averages | 1 year |
Retention periods can be adjusted in the configuration file. For longer retention, export metrics to an external time-series database.
Tips
- Use structured JSON logging for integration with log aggregation tools.
- Test alert channels regularly to ensure they work.
- Monitor backup success alongside system metrics.
- Export to Prometheus for fleet-wide dashboards.