Remediation Approval - User Guide

A comprehensive guide for using the memory remediation approval gateway to review and apply AI-generated memory adjustments for Haiven container services.

Getting Started
Understanding the Workflow
Using the Approval UI
Viewing Pending Recommendations
Email Notifications
Safety and Security
Cronicle Job Configuration
Troubleshooting
Viewing Reports
FAQ

Getting Started

Accessing the Service

Open your browser and navigate to:

Internal URL: https://remediation.haiven.site
External URL: https://remediation.haiven.site

The service works best with modern browsers (Chrome, Firefox, Safari, Edge).

What This Service Does

The remediation approval service acts as a human-in-the-loop gateway for AI-generated memory remediation recommendations. When containers on the Haiven platform approach memory limits, an automated Cronicle job analyzes the situation using AI (LiteLLM) and generates recommendations. These recommendations are sent to administrators for approval before any changes are applied.

Key capabilities:
- Review AI-analyzed memory recommendations with detailed evidence
- Approve or reject memory limit changes with a single click
- View complete audit trail of all decisions
- Monitor pending recommendations
- Automatic token expiry and rate limiting for safety

Understanding the Workflow

End-to-End Flow

Monitoring (Every 5 minutes)
- Cronicle runs memory-remediation.py script
- Script queries Prometheus for container memory metrics
- Containers above 85% memory threshold are flagged
Analysis
- For each flagged container, the script gathers:
- Current memory usage and limit
- 24-hour average and peak memory
- Restart count and runtime context
- Compose file location
- LiteLLM (using qwen3-30b-a3b by default) analyzes the memory profile
Recommendation Generation
- AI generates one of four actions (see table below)
- Each recommendation includes a risk score (1-10), detailed analysis, and evidence
Email Notification
- Email sent to admin via smtp-relay
- Contains summary with three action links: Approve, Reject, Details
Human Approval
- Admin reviews recommendation in the approval UI
- Clicks Approve or Reject
Change Application (if approved)
- Service modifies deploy.resources.limits.memory in docker-compose.yml
- Runs docker compose up -d to restart container with new limit
- Records decision in audit log

Recommendation Actions

Action	Meaning	What Happens on Approval
INCREASE_LIMIT	Container needs more memory	Compose file updated with new limit, container restarts
RESTART	Memory leak detected	Informational — admin restarts manually
INVESTIGATE	Unusual pattern needs human review	No automatic action
NO_ACTION	Memory usage is normal	No email sent, logged in report only

Using the Approval UI

Reviewing a Recommendation

Click "Details" from the email or pending list
Review the details page showing:
- Container name and service
- Recommended action with risk score badge
- Current memory metrics (usage, limit, 24h average, 24h peak)
- Compose file location
- LLM Analysis with AI reasoning
- Evidence from Prometheus metrics
Check the risk score (1-10):
- 1-3 (Green) — Low risk, straightforward change
- 4-6 (Yellow) — Medium risk, some uncertainty
- 7-10 (Red) — High risk, large change or unusual pattern

Approving a Recommendation

Click "Approve" button
Review the confirmation page showing before/after memory limits
Click "Confirm & Apply"

What happens:
- Compose file's deploy.resources.limits.memory is updated
- docker compose up -d restarts the container (brief downtime)
- Audit entry is created with full details
- Token is consumed and cannot be reused

Rejecting a Recommendation

Click "Reject" from the details page
No changes are made to compose files or containers
Decision is logged in audit trail

Viewing Pending Recommendations

Navigate to "Pending" in the UI to see all recommendations awaiting action.

Each entry shows:
- Container name and recommended action (color-coded badge)
- New memory limit (if applicable) and risk score
- Created timestamp
- Status: Active (can approve) or EXPIRED (>24h, wait for next cycle)

Email Notifications

Email Contents

When a recommendation is generated, you receive an HTML email with:

Subject: [Haiven] Memory Remediation: [Action] — [Container]
Summary table: Container name, current usage, limit, 24h average/peak, restart count
Recommendation: Action type, new limit, risk score, LLM analysis, evidence
Action buttons: Approve (green), Reject (red), View Details (gray)
Footer: Token expiry notice (24 hours)

Token Expiry

All approval tokens expire 24 hours after generation
Expired links show an error message
If the issue persists, a new recommendation is generated in the next Cronicle cycle

Safety and Security

HMAC Token Verification

Every approval link contains an HMAC-SHA256 signature:
- Tokens are cryptographically signed with REMEDIATION_SECRET
- Tampered tokens are rejected with 403 Forbidden
- Tokens include container name, action, new limit, and timestamp

Rate Limiting

Maximum 1 approval per container per hour
Prevents cascading failures from rapid memory adjustments

What Changes Are Made

The service has very limited scope:
- Only modifies: deploy.resources.limits.memory in docker-compose.yml
- Never touches: Any other compose fields, environment variables, or volumes

Audit Trail

Every action is logged with:
- Timestamp, token, container name
- Action (APPROVED/REJECTED)
- Old and new memory limits
- Result status (success/error/restart_failed/rejected)

Audit log location: /mnt/storage/remediation-approval/data/audit.json

Cronicle Job Configuration

Where to Configure

Access Cronicle at: https://scheduler.haiven.site

Navigate to Schedule > Memory Remediation > Edit.

Required Environment Variables

Set in the Cronicle job's Environment tab:

REMEDIATION_SECRET=<same-as-fastapi-env>
LITELLM_MASTER_KEY=<from-litellm/.env>

Optional Overrides

Variable	Default	Description
`REMEDIATION_THRESHOLD`	85	Memory usage % threshold
`LITELLM_MODEL`	qwen3-30b-a3b	LLM model for analysis
`APPROVAL_BASE_URL`	https://remediation.haiven.site	Base URL for email links
`NOTIFICATION_EMAIL`	elijah@elijahryoung.com	Admin email address

Dry-Run Mode

Test the system before enabling automatic emails:

Add --dry-run to the job's Arguments field
Run job manually and check Cronicle output
Recommendations are generated but no emails sent
Review reports in /mnt/storage/remediation/reports/

Troubleshooting

Token Expired

Tokens expire after 24 hours
If issue persists, Cronicle generates a new recommendation in the next cycle
No action needed — wait for fresh email

Email Not Received

# Check smtp-relay status
docker logs smtp-relay

# Check latest report
ls -lt /mnt/storage/remediation/reports/ | head -5

Approval Failed

# Check service logs
docker logs remediation-approval

# Common causes: compose file not found, permission denied, service name mismatch

No Recommendations Generated

Containers may lack explicit mem_limit — add deploy.resources.limits.memory to their compose files
Threshold too high — lower REMEDIATION_THRESHOLD in Cronicle env
Prometheus not scraping — verify: curl 'http://prometheus:9090/api/v1/query?query=container_memory_usage_bytes'

Manual Test

docker exec cronicle python3 /mnt/apps/docker/infrastructure/cronicle/scripts/memory-remediation.py --dry-run

Viewing Reports

Location

Reports: /mnt/storage/remediation/reports/memory-remediation-*.json
Tokens: /mnt/storage/remediation/reports/tokens/*.json

View Latest Report

ls -lt /mnt/storage/remediation/reports/*.json | head -1 | xargs cat | python3 -m json.tool

Automatic Cleanup

Reports older than 30 days are automatically deleted
Cleanup runs at the start of each Cronicle cycle
Audit log is never automatically deleted

FAQ

Q: How often does the system check?
A: Every 5 minutes (configurable in Cronicle).

Q: What if I don't approve a recommendation?
A: After 24h the token expires. If the issue persists, a new recommendation is generated.

Q: Can I set a different memory limit than recommended?
A: Not through the UI. Reject the recommendation and manually edit the compose file.

Q: What does risk score mean?
A: 1-3 = low risk, 4-6 = medium, 7-10 = high. Higher scores indicate more uncertainty or larger changes.

Q: Where is the REMEDIATION_SECRET?
A: In /mnt/apps/docker/infrastructure/remediation-approval/.env and Cronicle job env (must match).

Quick Reference

Resource	Location
Approval UI	https://remediation.haiven.site
External Access	https://remediation.haiven.site
Cronicle Scheduler	https://scheduler.haiven.site
Audit Log	`/mnt/storage/remediation-approval/data/audit.json`
Reports	`/mnt/storage/remediation/reports/`
Service Logs	`docker logs -f remediation-approval`

Generated by haiven-service-onboarding plugin