Monitoring

Cloud OS includes a built-in time-series monitoring engine that collects system and per-app container metrics, stores historical data with automatic rollup compaction, and provides Recharts-based visualizations. No external monitoring stack is required.

System Metrics

The monitoring page displays real-time and historical charts for:

CPU — utilization percentage and load average
RAM — used, available, and total memory
Disk I/O — read and write throughput per disk
Network I/O — bandwidth per interface
Temperature — CPU and system temperature (where hardware supports it)
Uptime — server uptime tracking

Use the time range selector to view data over 1 hour, 6 hours, 24 hours, 7 days, or 30 days.

Per-App Container Metrics

Every installed app has its own resource detail view accessible from the Monitoring page or the app detail page. Per-app metrics include:

Metric	Description
Container CPU	CPU percentage used by the app containers
Container RAM	Memory consumption per container
Container Network	Inbound and outbound traffic
Restart Count	Number of container restarts since install

Data Storage and Rollup

Metrics are stored in SQLite at three levels of granularity. A background goroutine runs the compaction automatically:

Resolution	Retention	Purpose
1 minute	24 hours	Real-time dashboard and recent charts
1 hour	30 days	Weekly and monthly trends
1 day	1 year	Long-term capacity planning

The collection interval is 10 seconds for real-time data pushed via WebSocket. The 1-minute resolution data points are aggregated from these raw readings. Older data is compacted into hourly and daily rollups on a schedule.

Querying Historical Data

The monitoring page lets you select a time range to query historical metrics. Depending on the range selected, the appropriate resolution tier is used:

Time Range	Resolution Used
1 hour	1-minute data
6 hours	1-minute data
24 hours	1-minute data
7 days	1-hour data
30 days	1-hour data

For programmatic access, use the monitoring API endpoints.

Monitoring API

Current System Metrics


GET /api/system/metrics/current

Returns the latest system metrics snapshot including CPU, RAM, disk, and network readings.

Historical System Metrics


GET /api/system/metrics

Query parameters:

Parameter	Description	Example
`range`	Time range to query	`1h`, `6h`, `24h`, `7d`, `30d`
`metric`	Specific metric type	`cpu`, `ram`, `disk`, `network`

Returns an array of time-series data points at the appropriate resolution for the requested range.

Per-App Metrics


GET /api/system/apps/metrics

Query parameters:

Parameter	Description	Example
`app_id`	Filter by specific app	`nextcloud`
`range`	Time range to query	`1h`, `6h`, `24h`, `7d`, `30d`

Returns container-level metrics (CPU, RAM, network) for the specified app or all apps.

Visualizations

All charts on the monitoring page are rendered with Recharts. The charts support:

Hover tooltips with exact values and timestamps
Responsive resizing
Automatic axis scaling
Multiple series overlay (for example, CPU usage across multiple containers)

External Integrations

While Cloud OS has built-in monitoring, you can forward metrics to external systems by installing them from the App Store:

Prometheus — install from the App Store and point it at the Cloud OS metrics endpoint
Grafana — install from the App Store and connect to Prometheus or query Cloud OS directly

Anomaly Detection

Cloud OS can detect unusual metric behavior automatically using statistical analysis, without requiring manual threshold configuration.

How It Works

The system maintains rolling baselines for each metric, broken down by hour of day and day of week. When a new metric sample arrives, it computes a Z-score against the baseline:

Z-Score	Severity	Meaning
> 3.0	Critical	Extreme deviation from normal
> 2.0	Warning	Notable deviation from normal
< 2.0	Normal	Within expected range

Baselines are updated continuously using an exponential moving average, so the system adapts to gradual changes in your workload patterns.

Predictive Alerting

The predictor uses linear regression on the last 24 hours of data to extrapolate metric values 1 hour ahead. If the predicted value crosses an alert threshold, a predictive alert fires before the problem actually occurs.

Predictive alerting requires the predictive_alerting license feature (Pro+ plan). Anomaly detection data builds over time — allow at least 7 days for accurate baselines.

Anomaly Detection API

Endpoint	Method	Description
`/api/alerting/anomalies`	GET	List recent anomaly events
`/api/alerting/baselines`	GET	View baselines for a metric. Use `metric` query parameter
`/api/alerting/predictions`	GET	Get prediction for a metric. Use `metric` query parameter
`/api/alerting/rules`	POST	Create an anomaly alert rule (set `type` to `anomaly`)

Configuring Anomaly Rules

Create an anomaly rule by posting to the alerting rules endpoint with type: "anomaly":

sensitivity — Z-score threshold (default: 2.0)
metric — which metric to monitor (e.g., cpu_percent, mem_used)
cooldown — minimum time between alerts for the same metric

Cost Optimization

The cost optimization engine analyzes resource usage patterns and provides actionable recommendations to reduce infrastructure costs.

Cost optimization requires the cost_optimization license feature (Pro+ plan).

Cost Model

Cloud OS estimates per-app costs based on actual resource consumption:

Resource	Default Rate	Configurable
CPU	$/core-hour	Yes
Memory	$/GB-hour	Yes
Disk	$/GB-hour	Yes

Recommendations

The engine generates three types of recommendations:

Idle Apps — Applications using less than 1% CPU for over 24 hours. Consider stopping or removing them.
Right-Sizing — Applications where allocated resources significantly exceed actual usage. Suggests adjusted resource limits.
Scale-to-Zero — Applications with no traffic outside business hours that could benefit from scheduling.

Cost API

Endpoint	Method	Description
`/api/cost/summary`	GET	Total estimated cost and top-spending apps
`/api/cost/recommendations`	GET	Idle apps, right-sizing and scheduling suggestions
`/api/cost/trends`	GET	Cost over time. Use `period` parameter (`7d` or `30d`)
`/api/cost/settings`	PUT	Configure cost rates

Event Timeline

The dashboard includes a chronological timeline of all system events, providing a unified view of everything happening on your server.

Event Categories

Category	Events Tracked
app	Deploy, stop, crash, restart
backup	Start, complete, fail
alert	Fired, resolved
update	Installed, rolled back
auth	Login, logout
system	Start, stop, configuration change

Events are color-coded by severity: green (info), yellow (warning), red (critical).

Timeline API

Endpoint	Method	Description
`/api/timeline`	GET	Paginated event list. Filter by `category`, `from`, `to`, `limit`
`/api/timeline/stats`	GET	Event counts grouped by category

The timeline on the dashboard auto-refreshes to show new events in real time.

Troubleshooting

Charts show no data

Verify that the Cloud OS backend is running and collecting metrics. Check the application logs for errors related to the metrics collection goroutine. If the server was recently installed, allow a few minutes for data to accumulate.

Historical data is missing

Check that the SQLite database is writable and that the rollup job has not encountered errors. The rollup compacts 1-minute data into 1-hour data after 24 hours, and 1-hour data into 1-day data after 30 days. Data outside these retention windows is permanently removed.

Metrics API returns empty results

Verify the range parameter is valid (one of 1h, 6h, 24h, 7d, 30d). Ensure you are authenticated with a valid JWT token. Check that the app_id parameter matches an installed app name if filtering by app.