PythonDjangoPostgreSQLTypescriptAI
Building the integrations platform and platform-wide reliability for a multi-tenant agentic security product - a SOC-investigation agent that triages alerts end-to-end.
- Agent tool surface - the integrations platform is the agent's hands. Plugged in third-party security vendors (SIEM, ticketing, comms) as alert/incident ingestion, enrichment, response actions, and custom playbook hooks. Shipped 20+ integrations, 50+ security tools on this surface.
- Agent tool-use, debugged and extended - when the SOC agent drifts on a real alert - wrong tool picked, wrong params - I dig into the trace and fix it at the tool definition, argument schema, or prompt level. Also packaging new vendor actions into GRPC/MCP tools for the agent.
- Backend API & data model - owned the core Django ORM models, REST surface, and auth/permissions around vendor connection lifecycle, action invocation, and audit + retry state. Turning ambiguous customer asks to platform features within days.
- Unified integration observability - normalised API error logs that were scattered across vendor-specific paths into a single structured event shape. Built the cross-integration dashboard for P95/P99 latency, error rate, and throughput, plus alerting on failure rates and a single health view across all integrations.
- Database reliability - cut production Postgres CPU 100%→60% and API latency p95 ~20x by fixing an incorrectly ordered composite index surfaced through query-plan analysis in prod; diagnosed connection-pool exhaustion under burst traffic and validated the PgBouncer migration.
- User-in-the-loop comms - built the Microsoft Teams service that drives investigation lifecycles for end users: notifications, structured input prompts, case reminders, escalations, and context gathering inline in chat.