Aether-Guard is a distributed infrastructure monitoring and migration system that combines a high-performance C++ agent, a .NET Core control plane, a Python AI engine, and a Next.js dashboard for real-time telemetry and risk-aware recovery.
This README documents the current implementation in this repo (v1.x) and the v2.2 reference architecture with a concrete implementation guide.
- Stage: MVP implementation in repo; v2.2 is a reference architecture target
- License: MIT
- Authors: Qi Junyi, Xiao Erdong (2026)
This project targets a product-grade release, not a demo. The following standards are required for delivery.
- Time-to-value (TTFV): first telemetry visible in under 15 minutes.
- One-command deploy, one-command fire drill, one-command rollback.
- Self-check and guidance: CLI/scripts validate dependencies, ports, CRIU/eBPF, and permissions with actionable errors.
- Guided first-run: register agent, receive data, trigger drill, observe migration.
- Explainability: AI risk reason, migration decision, and failures visible in the UI.
- Recovery help: diagnostics bundle export with logs, snapshots, and config.
- Docs-as-product: README, Quickstart, Troubleshooting, FAQ, deploy/upgrade/rollback.
- Security and trust chain: auth for Agent/Telemetry/Artifact/Command, mTLS with rotation, audit logs, SBOM/SLSA/signing.
- Reliability and resilience: idempotency, retries with backoff, rate limits, circuit breakers, MQ backpressure and DLQ.
- Observability: OpenTelemetry traces/metrics/logs, consistent trace_id, health and readiness probes.
- Deployment and operations: Helm + Compose, config validation, backup/restore, runbooks.
- Data governance: schema registry and compatibility, retention/cleanup, snapshot lifecycle, migrations.
- Compatibility and evolution: API versioning, capability negotiation, deprecation policy.
- Performance and scale: streaming uploads/downloads, capacity baselines, horizontal scaling strategy.
- Agent (C++): REST/JSON telemetry; CRIU checkpointing with automatic simulation fallback.
- Core API (.NET 8): REST controllers, RabbitMQ ingestion worker, migration orchestration, PostgreSQL storage.
- AI Engine (FastAPI): volatility and trend rules; Core currently sends empty spotPriceHistory (see Risk Logic).
- Dashboard (Next.js): telemetry and command visibility with NextAuth credentials.
- Storage: snapshots stored on local filesystem (Docker volume in compose).
- Security: API key for command endpoints; no mTLS, OpenTelemetry, or schema registry yet.
- No self-check tooling, guided onboarding, or diagnostics bundle.
- No end-to-end auth on telemetry or artifacts; no mTLS.
- No OpenTelemetry tracing or standardized structured logging.
- No schema registry or compatibility policy for MQ events.
- No object storage or retention policy for snapshots.
- Core API enables gRPC + JSON transcoding so internal traffic uses Protobuf and external clients keep REST/JSON.
- W3C trace context must propagate across HTTP and RabbitMQ by injecting traceparent and tracestate headers.
- Agent performs capability probe (kernel, CRIU, eBPF, feature flags) and reports a Capabilities payload at registration.
- Core responds with AgentConfig to enable or disable features based on compatibility and policy.
- RabbitMQ messages use explicit schemas (Avro/Protobuf) with schema_id + payload, plus upcaster logic for old events.
- Snapshots move to object storage (MinIO/S3) with hot, warm, and cold retention policies.
- Workload identity uses SPIFFE/SPIRE with short-lived SVIDs; mTLS replaces static certs.
- Supply chain uses SLSA provenance, SBOM, and signed images.
- RabbitMQ uses QoS prefetch and explicit ack; failures route to DLQ.
- Idempotency keys are required for critical commands; agents cache recent request_ids to avoid re-execution.
- Policy plugins run in Wasmtime with fuel and memory limits.
- Runbook automation triggers scripts and attaches artifacts to alerts.
- Add self-check scripts (agent/core dependencies, ports, permissions).
- Add first-run guide in the dashboard.
- Add explainability fields and failure reasons in UI.
- Add diagnostics bundle export.
- Expand docs: Quickstart, Troubleshooting, FAQ, upgrade/rollback.
- Enable gRPC JSON transcoding in Core.
- Define Protobuf contracts for Agent/Core APIs.
- Inject W3C trace context into RabbitMQ headers.
- Add DetectCapabilities() in the Agent boot sequence.
- Extend /register to accept Capabilities and return AgentConfig.
- Introduce SPIRE or cert-manager based certificate rotation.
- Deploy MinIO (S3 compatible) for snapshots.
- Update ArtifactController to stream to S3 SDK.
- Add SLSA provenance generation in CI.
- Agent (C++) -> Core API (.NET) -> AI Engine (FastAPI) -> Core API -> PostgreSQL -> Dashboard (Next.js)
- agent-service: C++ telemetry agent with CRIU-based checkpointing (auto-falls back to simulation mode when CRIU is unavailable).
- core-service: ASP.NET Core API for ingestion, analysis, migration orchestration, and data access.
- ai-service: FastAPI service for volatility-based risk scoring.
- web-service: Next.js dashboard with authentication and visualization.
- db: PostgreSQL for persistence.
- rabbitmq: message broker for telemetry ingestion.
- redis: dedup cache for telemetry ingestion.
- Core API: http://localhost:5000
- Dashboard: http://localhost:3000
- AI Engine: http://localhost:8000
- PostgreSQL: localhost:5432
- RabbitMQ Management: http://localhost:15672
docker compose up --build -dOpen the dashboard at http://localhost:3000.
If you want to simulate migrations, start at least two agents:
docker compose up -d --scale agent-service=2 agent-servicedocker compose up --build -d- Open http://localhost:3000 and log in.
- Start at least two agents (see command above).
- Run the fire drill:
python scripts/fire_drill.py start - Confirm the dashboard shows risk state changes and migration activity.
Trigger a market crash simulation:
python scripts/fire_drill.py startReset back to stable:
python scripts/fire_drill.py stopThese scripts validate demo flows and can be reused as product readiness checks:
verify_blueprint_v1.pyverify_phase2.pyverify_phase3.py
- Username: admin
- Password: admin123
Override via environment variables:
- DASHBOARD_ADMIN_USER
- DASHBOARD_ADMIN_PASSWORD
Core API database connection (docker-compose.yml):
- ConnectionStrings__DefaultConnection=Host=db;Database=AetherGuardDb;Username=postgres;Password=password
Core API artifact base URL (docker-compose.yml):
- ArtifactBaseUrl=http://core-service:8080
Dashboard auth (docker-compose.yml):
- AUTH_SECRET=super-secret-key
- AUTH_TRUST_HOST=true
For production, set a strong AUTH_SECRET and use a secret manager.
Core API:
- POST /api/v1/ingestion - receive telemetry from agent
- GET /api/v1/dashboard/latest - latest telemetry + AI analysis
- GET /api/v1/dashboard/history - last 20 telemetry records (chronological)
- POST /api/v1/market/signal - update market signal file
- POST /api/v1/artifacts/upload/{workloadId} - upload snapshot
- GET /api/v1/artifacts/download/{workloadId} - download latest snapshot
AI Engine:
- POST /analyze - classify telemetry with spotPriceHistory, rebalanceSignal, capacityScore
The demo uses file-based signals that are mounted into containers via docker-compose:
- Core signal: src/services/core-dotnet/AetherGuard.Core/Data/market_signal.json
- AI prices: src/services/ai-engine/Data/spot_prices.json
The fire drill script writes these files and creates the directories if missing.
Risk scoring uses these rules:
- rebalanceSignal=true: CRITICAL (Cloud Provider Signal)
- Trend > 0.2: CRITICAL (Price Spike Detected)
- Volatility > 5.0: CRITICAL (Market Instability)
- Otherwise: LOW (Stable)
Note: The Core API currently sends an empty spotPriceHistory list; wire that data into Analyze requests to drive volatility decisions.
TelemetryRecord persisted to PostgreSQL:
- AgentId
- WorkloadTier
- RebalanceSignal
- DiskAvailable
- CpuUsage (defaults to 0 in the current pipeline)
- MemoryUsage (defaults to 0 in the current pipeline)
- AiStatus
- AiConfidence
- RootCause
- PredictedCpu
- Timestamp (UTC)
For production, add EF Core migrations and a formal upgrade process.
Dashboard:
cd src/web/dashboard
npm install
npm run devCore API:
cd src/services/core-dotnet/AetherGuard.Core
dotnet restore
dotnet runAI Engine:
cd src/services/ai-engine
python -m venv .venv
.venv/Scripts/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000C++ Agent:
cd src/services/agent-cpp
cmake -S . -B build
cmake --build build
./build/AetherAgentNote: If CRIU is unavailable (Windows/Docker Desktop), the agent runs in simulation mode and still produces a valid snapshot archive for demo flows.
- Authentication uses NextAuth Credentials for the MVP; use an external identity provider for production.
- CORS is limited to http://localhost:3000 in development.
- Secrets and credentials must be rotated for any public deployment.
Please read CONTRIBUTING.md for setup, workflow, and PR guidelines.
MIT License. See LICENSE.
