Remediation Approval - User Guide

A comprehensive guide for using the memory remediation approval gateway to review and apply AI-generated memory adjustments for Haiven container services.


Table of Contents

  1. Getting Started
  2. Understanding the Workflow
  3. Using the Approval UI
  4. Viewing Pending Recommendations
  5. Email Notifications
  6. Safety and Security
  7. Cronicle Job Configuration
  8. Troubleshooting
  9. Viewing Reports
  10. FAQ

Getting Started

Accessing the Service

Open your browser and navigate to:

The service works best with modern browsers (Chrome, Firefox, Safari, Edge).

What This Service Does

The remediation approval service acts as a human-in-the-loop gateway for AI-generated memory remediation recommendations. When containers on the Haiven platform approach memory limits, an automated Cronicle job analyzes the situation using AI (LiteLLM) and generates recommendations. These recommendations are sent to administrators for approval before any changes are applied.

Key capabilities:
- Review AI-analyzed memory recommendations with detailed evidence
- Approve or reject memory limit changes with a single click
- View complete audit trail of all decisions
- Monitor pending recommendations
- Automatic token expiry and rate limiting for safety


Understanding the Workflow

End-to-End Flow

  1. Monitoring (Every 5 minutes)
    - Cronicle runs memory-remediation.py script
    - Script queries Prometheus for container memory metrics
    - Containers above 85% memory threshold are flagged

  2. Analysis
    - For each flagged container, the script gathers:

  3. Recommendation Generation
    - AI generates one of four actions (see table below)
    - Each recommendation includes a risk score (1-10), detailed analysis, and evidence

  4. Email Notification
    - Email sent to admin via smtp-relay
    - Contains summary with three action links: Approve, Reject, Details

  5. Human Approval
    - Admin reviews recommendation in the approval UI
    - Clicks Approve or Reject

  6. Change Application (if approved)
    - Service modifies deploy.resources.limits.memory in docker-compose.yml
    - Runs docker compose up -d to restart container with new limit
    - Records decision in audit log

Recommendation Actions

Action Meaning What Happens on Approval
INCREASE_LIMIT Container needs more memory Compose file updated with new limit, container restarts
RESTART Memory leak detected Informational — admin restarts manually
INVESTIGATE Unusual pattern needs human review No automatic action
NO_ACTION Memory usage is normal No email sent, logged in report only

Using the Approval UI

Reviewing a Recommendation

  1. Click "Details" from the email or pending list
  2. Review the details page showing:
    - Container name and service
    - Recommended action with risk score badge
    - Current memory metrics (usage, limit, 24h average, 24h peak)
    - Compose file location
    - LLM Analysis with AI reasoning
    - Evidence from Prometheus metrics
  3. Check the risk score (1-10):
    - 1-3 (Green) — Low risk, straightforward change
    - 4-6 (Yellow) — Medium risk, some uncertainty
    - 7-10 (Red) — High risk, large change or unusual pattern

Approving a Recommendation

  1. Click "Approve" button
  2. Review the confirmation page showing before/after memory limits
  3. Click "Confirm & Apply"

What happens:
- Compose file's deploy.resources.limits.memory is updated
- docker compose up -d restarts the container (brief downtime)
- Audit entry is created with full details
- Token is consumed and cannot be reused

Rejecting a Recommendation

  1. Click "Reject" from the details page
  2. No changes are made to compose files or containers
  3. Decision is logged in audit trail

Viewing Pending Recommendations

Navigate to "Pending" in the UI to see all recommendations awaiting action.

Each entry shows:
- Container name and recommended action (color-coded badge)
- New memory limit (if applicable) and risk score
- Created timestamp
- Status: Active (can approve) or EXPIRED (>24h, wait for next cycle)


Email Notifications

Email Contents

When a recommendation is generated, you receive an HTML email with:

Token Expiry


Safety and Security

HMAC Token Verification

Every approval link contains an HMAC-SHA256 signature:
- Tokens are cryptographically signed with REMEDIATION_SECRET
- Tampered tokens are rejected with 403 Forbidden
- Tokens include container name, action, new limit, and timestamp

Rate Limiting

What Changes Are Made

The service has very limited scope:
- Only modifies: deploy.resources.limits.memory in docker-compose.yml
- Never touches: Any other compose fields, environment variables, or volumes

Audit Trail

Every action is logged with:
- Timestamp, token, container name
- Action (APPROVED/REJECTED)
- Old and new memory limits
- Result status (success/error/restart_failed/rejected)

Audit log location: /mnt/storage/remediation-approval/data/audit.json


Cronicle Job Configuration

Where to Configure

Access Cronicle at: https://scheduler.haiven.site

Navigate to Schedule > Memory Remediation > Edit.

Required Environment Variables

Set in the Cronicle job's Environment tab:

REMEDIATION_SECRET=<same-as-fastapi-env>
LITELLM_MASTER_KEY=<from-litellm/.env>

Optional Overrides

Variable Default Description
REMEDIATION_THRESHOLD 85 Memory usage % threshold
LITELLM_MODEL qwen3-30b-a3b LLM model for analysis
APPROVAL_BASE_URL https://remediation.haiven.site Base URL for email links
NOTIFICATION_EMAIL elijah@elijahryoung.com Admin email address

Dry-Run Mode

Test the system before enabling automatic emails:

  1. Add --dry-run to the job's Arguments field
  2. Run job manually and check Cronicle output
  3. Recommendations are generated but no emails sent
  4. Review reports in /mnt/storage/remediation/reports/

Troubleshooting

Token Expired

Email Not Received

# Check smtp-relay status
docker logs smtp-relay

# Check latest report
ls -lt /mnt/storage/remediation/reports/ | head -5

Approval Failed

# Check service logs
docker logs remediation-approval

# Common causes: compose file not found, permission denied, service name mismatch

No Recommendations Generated

Manual Test

docker exec cronicle python3 /mnt/apps/docker/infrastructure/cronicle/scripts/memory-remediation.py --dry-run

Viewing Reports

Location

Reports: /mnt/storage/remediation/reports/memory-remediation-*.json
Tokens: /mnt/storage/remediation/reports/tokens/*.json

View Latest Report

ls -lt /mnt/storage/remediation/reports/*.json | head -1 | xargs cat | python3 -m json.tool

Automatic Cleanup


FAQ

Q: How often does the system check?
A: Every 5 minutes (configurable in Cronicle).

Q: What if I don't approve a recommendation?
A: After 24h the token expires. If the issue persists, a new recommendation is generated.

Q: Can I set a different memory limit than recommended?
A: Not through the UI. Reject the recommendation and manually edit the compose file.

Q: What does risk score mean?
A: 1-3 = low risk, 4-6 = medium, 7-10 = high. Higher scores indicate more uncertainty or larger changes.

Q: Where is the REMEDIATION_SECRET?
A: In /mnt/apps/docker/infrastructure/remediation-approval/.env and Cronicle job env (must match).


Quick Reference

Resource Location
Approval UI https://remediation.haiven.site
External Access https://remediation.haiven.site
Cronicle Scheduler https://scheduler.haiven.site
Audit Log /mnt/storage/remediation-approval/data/audit.json
Reports /mnt/storage/remediation/reports/
Service Logs docker logs -f remediation-approval

Generated by haiven-service-onboarding plugin