Monitoring Setup

Cloud OS includes built-in monitoring for system and application metrics. For fleet deployments, it also supports exporting metrics to external tools.

Built-in Monitoring

The Cloud OS dashboard provides real-time views of:

System metrics — CPU, memory, disk, and network usage
App metrics — per-container resource consumption
Disk health — S.M.A.R.T. status and partition usage
Audit logs — administrative actions with timestamps
Alert history — triggered and resolved alerts

For single-instance deployments, the built-in monitoring is often sufficient.

Alert Channels

Configure notification channels to receive alerts when metrics cross thresholds. Supported channel types:

Type	Description
Email	SMTP-based notifications
Slack	Channel notifications via webhook
Telegram	Bot notifications
Webhook	Custom HTTP endpoint
PagerDuty	Incident management

Configure alert channels from the Settings > Alerts section of the dashboard.

Configure at least two alert channels for redundancy (e.g., Slack and email). Test channels after setup to verify delivery.

Common Alert Rules

Condition	Severity
CPU usage above 90% for 5 minutes	Warning
Memory usage above 90% for 5 minutes	Warning
Disk usage above 85%	Warning
Disk usage above 95%	Critical
App container crashed	Critical
Backup overdue (over 12 hours)	Warning
Security score below 70	Warning

Create and manage alert rules from the dashboard.

Disk Health

Cloud OS monitors disk health using S.M.A.R.T. data to provide early warning of hardware failures. Automatic alerts are generated for bad sectors, overheating, and SSD wear levels.

External Integrations

For fleet deployments or advanced monitoring, Cloud OS can expose a Prometheus-compatible metrics endpoint. Add the Cloud OS instance as a Prometheus scrape target and use Grafana for dashboards.

Recommended Stack

Component	Tool
Metrics	Prometheus + Grafana
Logs	Loki + Grafana
Alerts	Grafana Alerting or PagerDuty
Uptime	UptimeRobot or similar

Metrics Retention

Built-in metrics are stored with decreasing resolution over time:

Resolution	Retention
Full resolution	24 hours
1-minute averages	7 days
5-minute averages	30 days
1-hour averages	1 year

Retention periods can be adjusted in the configuration file. For longer retention, export metrics to an external time-series database.

Tips

Use structured JSON logging for integration with log aggregation tools.
Test alert channels regularly to ensure they work.
Monitor backup success alongside system metrics.
Export to Prometheus for fleet-wide dashboards.