Skip to Content
DocsCloud OSMonitoring

Monitoring

Cloud OS includes a built-in time-series monitoring engine that collects system and per-app container metrics, stores historical data with automatic rollup compaction, and provides Recharts-based visualizations. No external monitoring stack is required.

System Metrics

The monitoring page displays real-time and historical charts for:

  • CPU — utilization percentage and load average
  • RAM — used, available, and total memory
  • Disk I/O — read and write throughput per disk
  • Network I/O — bandwidth per interface
  • Temperature — CPU and system temperature (where hardware supports it)
  • Uptime — server uptime tracking

Use the time range selector to view data over 1 hour, 6 hours, 24 hours, 7 days, or 30 days.

Per-App Container Metrics

Every installed app has its own resource detail view accessible from the Monitoring page or the app detail page. Per-app metrics include:

MetricDescription
Container CPUCPU percentage used by the app containers
Container RAMMemory consumption per container
Container NetworkInbound and outbound traffic
Restart CountNumber of container restarts since install

Data Storage and Rollup

Metrics are stored in SQLite at three levels of granularity. A background goroutine runs the compaction automatically:

ResolutionRetentionPurpose
1 minute24 hoursReal-time dashboard and recent charts
1 hour30 daysWeekly and monthly trends
1 day1 yearLong-term capacity planning

The collection interval is 10 seconds for real-time data pushed via WebSocket. The 1-minute resolution data points are aggregated from these raw readings. Older data is compacted into hourly and daily rollups on a schedule.

Querying Historical Data

The monitoring page lets you select a time range to query historical metrics. Depending on the range selected, the appropriate resolution tier is used:

Time RangeResolution Used
1 hour1-minute data
6 hours1-minute data
24 hours1-minute data
7 days1-hour data
30 days1-hour data

For programmatic access, use the monitoring API endpoints.

Monitoring API

Current System Metrics

GET /api/system/metrics/current

Returns the latest system metrics snapshot including CPU, RAM, disk, and network readings.

Historical System Metrics

GET /api/system/metrics

Query parameters:

ParameterDescriptionExample
rangeTime range to query1h, 6h, 24h, 7d, 30d
metricSpecific metric typecpu, ram, disk, network

Returns an array of time-series data points at the appropriate resolution for the requested range.

Per-App Metrics

GET /api/system/apps/metrics

Query parameters:

ParameterDescriptionExample
app_idFilter by specific appnextcloud
rangeTime range to query1h, 6h, 24h, 7d, 30d

Returns container-level metrics (CPU, RAM, network) for the specified app or all apps.

Visualizations

All charts on the monitoring page are rendered with Recharts. The charts support:

  • Hover tooltips with exact values and timestamps
  • Responsive resizing
  • Automatic axis scaling
  • Multiple series overlay (for example, CPU usage across multiple containers)

External Integrations

While Cloud OS has built-in monitoring, you can forward metrics to external systems by installing them from the App Store:

  • Prometheus — install from the App Store and point it at the Cloud OS metrics endpoint
  • Grafana — install from the App Store and connect to Prometheus or query Cloud OS directly

Anomaly Detection

Cloud OS can detect unusual metric behavior automatically using statistical analysis, without requiring manual threshold configuration.

How It Works

The system maintains rolling baselines for each metric, broken down by hour of day and day of week. When a new metric sample arrives, it computes a Z-score against the baseline:

Z-ScoreSeverityMeaning
> 3.0CriticalExtreme deviation from normal
> 2.0WarningNotable deviation from normal
< 2.0NormalWithin expected range

Baselines are updated continuously using an exponential moving average, so the system adapts to gradual changes in your workload patterns.

Predictive Alerting

The predictor uses linear regression on the last 24 hours of data to extrapolate metric values 1 hour ahead. If the predicted value crosses an alert threshold, a predictive alert fires before the problem actually occurs.

Predictive alerting requires the predictive_alerting license feature (Pro+ plan). Anomaly detection data builds over time — allow at least 7 days for accurate baselines.

Anomaly Detection API

EndpointMethodDescription
/api/alerting/anomaliesGETList recent anomaly events
/api/alerting/baselinesGETView baselines for a metric. Use metric query parameter
/api/alerting/predictionsGETGet prediction for a metric. Use metric query parameter
/api/alerting/rulesPOSTCreate an anomaly alert rule (set type to anomaly)

Configuring Anomaly Rules

Create an anomaly rule by posting to the alerting rules endpoint with type: "anomaly":

  • sensitivity — Z-score threshold (default: 2.0)
  • metric — which metric to monitor (e.g., cpu_percent, mem_used)
  • cooldown — minimum time between alerts for the same metric

Cost Optimization

The cost optimization engine analyzes resource usage patterns and provides actionable recommendations to reduce infrastructure costs.

Cost optimization requires the cost_optimization license feature (Pro+ plan).

Cost Model

Cloud OS estimates per-app costs based on actual resource consumption:

ResourceDefault RateConfigurable
CPU$/core-hourYes
Memory$/GB-hourYes
Disk$/GB-hourYes

Recommendations

The engine generates three types of recommendations:

  • Idle Apps — Applications using less than 1% CPU for over 24 hours. Consider stopping or removing them.
  • Right-Sizing — Applications where allocated resources significantly exceed actual usage. Suggests adjusted resource limits.
  • Scale-to-Zero — Applications with no traffic outside business hours that could benefit from scheduling.

Cost API

EndpointMethodDescription
/api/cost/summaryGETTotal estimated cost and top-spending apps
/api/cost/recommendationsGETIdle apps, right-sizing and scheduling suggestions
/api/cost/trendsGETCost over time. Use period parameter (7d or 30d)
/api/cost/settingsPUTConfigure cost rates

Event Timeline

The dashboard includes a chronological timeline of all system events, providing a unified view of everything happening on your server.

Event Categories

CategoryEvents Tracked
appDeploy, stop, crash, restart
backupStart, complete, fail
alertFired, resolved
updateInstalled, rolled back
authLogin, logout
systemStart, stop, configuration change

Events are color-coded by severity: green (info), yellow (warning), red (critical).

Timeline API

EndpointMethodDescription
/api/timelineGETPaginated event list. Filter by category, from, to, limit
/api/timeline/statsGETEvent counts grouped by category

The timeline on the dashboard auto-refreshes to show new events in real time.

Troubleshooting

Charts show no data

Verify that the Cloud OS backend is running and collecting metrics. Check the application logs for errors related to the metrics collection goroutine. If the server was recently installed, allow a few minutes for data to accumulate.

Historical data is missing

Check that the SQLite database is writable and that the rollup job has not encountered errors. The rollup compacts 1-minute data into 1-hour data after 24 hours, and 1-hour data into 1-day data after 30 days. Data outside these retention windows is permanently removed.

Metrics API returns empty results

Verify the range parameter is valid (one of 1h, 6h, 24h, 7d, 30d). Ensure you are authenticated with a valid JWT token. Check that the app_id parameter matches an installed app name if filtering by app.