Monitoring
Cloud OS includes a built-in time-series monitoring engine that collects system and per-app container metrics, stores historical data with automatic rollup compaction, and provides Recharts-based visualizations. No external monitoring stack is required.
System Metrics
The monitoring page displays real-time and historical charts for:
- CPU — utilization percentage and load average
- RAM — used, available, and total memory
- Disk I/O — read and write throughput per disk
- Network I/O — bandwidth per interface
- Temperature — CPU and system temperature (where hardware supports it)
- Uptime — server uptime tracking
Use the time range selector to view data over 1 hour, 6 hours, 24 hours, 7 days, or 30 days.
Per-App Container Metrics
Every installed app has its own resource detail view accessible from the Monitoring page or the app detail page. Per-app metrics include:
| Metric | Description |
|---|---|
| Container CPU | CPU percentage used by the app containers |
| Container RAM | Memory consumption per container |
| Container Network | Inbound and outbound traffic |
| Restart Count | Number of container restarts since install |
Data Storage and Rollup
Metrics are stored in SQLite at three levels of granularity. A background goroutine runs the compaction automatically:
| Resolution | Retention | Purpose |
|---|---|---|
| 1 minute | 24 hours | Real-time dashboard and recent charts |
| 1 hour | 30 days | Weekly and monthly trends |
| 1 day | 1 year | Long-term capacity planning |
The collection interval is 10 seconds for real-time data pushed via WebSocket. The 1-minute resolution data points are aggregated from these raw readings. Older data is compacted into hourly and daily rollups on a schedule.
Querying Historical Data
The monitoring page lets you select a time range to query historical metrics. Depending on the range selected, the appropriate resolution tier is used:
| Time Range | Resolution Used |
|---|---|
| 1 hour | 1-minute data |
| 6 hours | 1-minute data |
| 24 hours | 1-minute data |
| 7 days | 1-hour data |
| 30 days | 1-hour data |
For programmatic access, use the monitoring API endpoints.
Monitoring API
Current System Metrics
GET /api/system/metrics/currentReturns the latest system metrics snapshot including CPU, RAM, disk, and network readings.
Historical System Metrics
GET /api/system/metricsQuery parameters:
| Parameter | Description | Example |
|---|---|---|
range | Time range to query | 1h, 6h, 24h, 7d, 30d |
metric | Specific metric type | cpu, ram, disk, network |
Returns an array of time-series data points at the appropriate resolution for the requested range.
Per-App Metrics
GET /api/system/apps/metricsQuery parameters:
| Parameter | Description | Example |
|---|---|---|
app_id | Filter by specific app | nextcloud |
range | Time range to query | 1h, 6h, 24h, 7d, 30d |
Returns container-level metrics (CPU, RAM, network) for the specified app or all apps.
Visualizations
All charts on the monitoring page are rendered with Recharts. The charts support:
- Hover tooltips with exact values and timestamps
- Responsive resizing
- Automatic axis scaling
- Multiple series overlay (for example, CPU usage across multiple containers)
External Integrations
While Cloud OS has built-in monitoring, you can forward metrics to external systems by installing them from the App Store:
- Prometheus — install from the App Store and point it at the Cloud OS metrics endpoint
- Grafana — install from the App Store and connect to Prometheus or query Cloud OS directly
Anomaly Detection
Cloud OS can detect unusual metric behavior automatically using statistical analysis, without requiring manual threshold configuration.
How It Works
The system maintains rolling baselines for each metric, broken down by hour of day and day of week. When a new metric sample arrives, it computes a Z-score against the baseline:
| Z-Score | Severity | Meaning |
|---|---|---|
| > 3.0 | Critical | Extreme deviation from normal |
| > 2.0 | Warning | Notable deviation from normal |
| < 2.0 | Normal | Within expected range |
Baselines are updated continuously using an exponential moving average, so the system adapts to gradual changes in your workload patterns.
Predictive Alerting
The predictor uses linear regression on the last 24 hours of data to extrapolate metric values 1 hour ahead. If the predicted value crosses an alert threshold, a predictive alert fires before the problem actually occurs.
Predictive alerting requires the predictive_alerting license feature (Pro+ plan). Anomaly detection data builds over time — allow at least 7 days for accurate baselines.
Anomaly Detection API
| Endpoint | Method | Description |
|---|---|---|
/api/alerting/anomalies | GET | List recent anomaly events |
/api/alerting/baselines | GET | View baselines for a metric. Use metric query parameter |
/api/alerting/predictions | GET | Get prediction for a metric. Use metric query parameter |
/api/alerting/rules | POST | Create an anomaly alert rule (set type to anomaly) |
Configuring Anomaly Rules
Create an anomaly rule by posting to the alerting rules endpoint with type: "anomaly":
- sensitivity — Z-score threshold (default: 2.0)
- metric — which metric to monitor (e.g.,
cpu_percent,mem_used) - cooldown — minimum time between alerts for the same metric
Cost Optimization
The cost optimization engine analyzes resource usage patterns and provides actionable recommendations to reduce infrastructure costs.
Cost optimization requires the cost_optimization license feature (Pro+ plan).
Cost Model
Cloud OS estimates per-app costs based on actual resource consumption:
| Resource | Default Rate | Configurable |
|---|---|---|
| CPU | $/core-hour | Yes |
| Memory | $/GB-hour | Yes |
| Disk | $/GB-hour | Yes |
Recommendations
The engine generates three types of recommendations:
- Idle Apps — Applications using less than 1% CPU for over 24 hours. Consider stopping or removing them.
- Right-Sizing — Applications where allocated resources significantly exceed actual usage. Suggests adjusted resource limits.
- Scale-to-Zero — Applications with no traffic outside business hours that could benefit from scheduling.
Cost API
| Endpoint | Method | Description |
|---|---|---|
/api/cost/summary | GET | Total estimated cost and top-spending apps |
/api/cost/recommendations | GET | Idle apps, right-sizing and scheduling suggestions |
/api/cost/trends | GET | Cost over time. Use period parameter (7d or 30d) |
/api/cost/settings | PUT | Configure cost rates |
Event Timeline
The dashboard includes a chronological timeline of all system events, providing a unified view of everything happening on your server.
Event Categories
| Category | Events Tracked |
|---|---|
| app | Deploy, stop, crash, restart |
| backup | Start, complete, fail |
| alert | Fired, resolved |
| update | Installed, rolled back |
| auth | Login, logout |
| system | Start, stop, configuration change |
Events are color-coded by severity: green (info), yellow (warning), red (critical).
Timeline API
| Endpoint | Method | Description |
|---|---|---|
/api/timeline | GET | Paginated event list. Filter by category, from, to, limit |
/api/timeline/stats | GET | Event counts grouped by category |
The timeline on the dashboard auto-refreshes to show new events in real time.
Troubleshooting
Charts show no data
Verify that the Cloud OS backend is running and collecting metrics. Check the application logs for errors related to the metrics collection goroutine. If the server was recently installed, allow a few minutes for data to accumulate.
Historical data is missing
Check that the SQLite database is writable and that the rollup job has not encountered errors. The rollup compacts 1-minute data into 1-hour data after 24 hours, and 1-hour data into 1-day data after 30 days. Data outside these retention windows is permanently removed.
Metrics API returns empty results
Verify the range parameter is valid (one of 1h, 6h, 24h, 7d, 30d). Ensure you are authenticated with a valid JWT token. Check that the app_id parameter matches an installed app name if filtering by app.