Back to Blog
AI / ML

Building AI Agents That Actually Work: Lessons from Shipping Resumine

The Demo Problem

Every week, someone posts a video of an AI agent doing something remarkable. Booking flights. Writing code. Filling out forms. The demos are stunning.

Then you try to ship one to real users, and you discover that the gap between "works in a demo" and "works in production" is roughly the same gap between a paper airplane and a Boeing 787. Both fly. One of them you'd trust with your career.

I'm building Resumine — a system of autonomous AI agents that discover job opportunities, filter them by relevance, personalize applications, and submit them 24/7 on behalf of job seekers. It's been one of the most humbling engineering challenges of my career. Here's what I've learned.

What Resumine Actually Does

Resumine operates as a multi-stage pipeline, with each stage handled by a specialized agent:

  1. Discovery Agent: Crawls 50+ job boards, aggregating fresh postings that match a candidate's profile using semantic search — not keyword matching.
  2. Scoring Agent: Evaluates each opportunity against the candidate's experience, skills, and stated preferences. Outputs a confidence score with reasoning.
  3. Personalization Agent: Tailors the resume and generates a cover letter for each high-scoring opportunity. This isn't template filling — it restructures emphasis, reorders experience, and highlights relevant projects.
  4. Application Agent: Navigates application forms, fills fields, uploads documents, and submits. This is the hardest part by far.
  5. Monitoring Agent: Tracks application status, follows up when appropriate, and reports results back to the candidate.

Each agent operates autonomously but within strict guardrails. The system processes hundreds of applications per day across all users.

The Architecture That Survived Production

Early on, I tried the approach that every tutorial suggests: give the LLM a set of tools and let it figure out the workflow. This works beautifully in demos and catastrophically in production.

The architecture that actually survived is what I call deterministic scaffolding with LLM decision points:

pipeline.py
class ApplicationPipeline:
    """
    Deterministic workflow with LLM decision points.
    The code controls HOW. The model controls WHAT.
    """
    def run(self, candidate: Candidate, job: Job) -> Result:
        # Step 1: Score relevance (LLM decision)
        score = self.scoring_agent.evaluate(candidate, job)
        if score.confidence < 0.8:
            return Result(status="skipped", reason=score.reasoning)

        # Step 2: Personalize materials (LLM generation)
        materials = self.personalization_agent.tailor(
            candidate, job, score.key_matches
        )

        # Step 3: Validate output (deterministic + LLM check)
        if not self.validate_materials(candidate, materials):
            return Result(status="validation_failed")

        # Step 4: Submit application (deterministic execution)
        try:
            submission = self.application_agent.submit(job, materials)
            return Result(status="submitted", details=submission)
        except ApplicationError as e:
            return self.handle_failure(e, candidate, job, materials)

The code handles the how — validation, error handling, retries, state management. The LLM handles the what — deciding which jobs are relevant, how to personalize content, what to write in a cover letter.

This separation is critical. LLMs are unreliable executors but excellent decision-makers. Your architecture should reflect that.

The Reliability Ceiling

Here's an uncomfortable truth about AI agents in 2026: most architectures plateau around 85–90% task completion on non-trivial workflows. That means roughly 1 in 10 users will encounter a failure.

For Resumine, the failure modes break down like this:

  • ~5% — Form navigation failures. Non-standard application forms with unusual field types, CAPTCHAs, or multi-page flows that break the agent's expectations.
  • ~3% — Personalization hallucinations. The LLM occasionally invents experience the candidate doesn't have or misattributes skills. This is the most dangerous failure because it looks correct.
  • ~2% — Infrastructure failures. Timeouts, rate limits, site changes, blocked IPs.

The most dangerous failures are the ones that appear successful. A confidently submitted application with fabricated experience is worse than a visible error. This is why validation at every stage isn't optional — it's the entire game.

Memory Management Is the Hardest Problem

An agent processing 50 applications in a session accumulates enormous context. Each job posting is 500–1,000 tokens. Each application generates another 500–1,000 tokens of state. Naively dumping everything into the context window hits token limits within minutes and degrades quality well before that.

The solution I landed on uses three memory tiers:

Working Memory (in-context window): Only the current job, the current application state, and the candidate's core profile. ~2,000 tokens max.

Session Memory (structured scratchpad): A compressed log of what's been done in this session — which jobs were scored, which were applied to, which failed. Injected as a brief summary when needed.

Long-term Memory (database): Full history of all applications, outcomes, and learned preferences. Queried selectively — for example, "don't apply to companies that previously rejected this candidate."

Treat memory like a cache hierarchy, not a log file. Hot data in context, warm data summarized, cold data in storage.

Graceful Degradation Over Retry Loops

Early versions of Resumine would retry failed steps with the same context. This is the cardinal sin of agent engineering. If the LLM failed to parse a form on the first try, sending it the same prompt again almost never works. You're just burning tokens and time.

Instead, I implemented cascading fallback strategies:

fallback.py
class FallbackChain:
    async def execute_with_fallback(self, job, materials):
        strategies = [
            self.full_automation,      # Try complete AI submission
            self.simplified_fields,    # Reduce to required fields only
            self.direct_email,         # Email materials to recruiter
            self.flag_for_review,      # Queue for human review
        ]

        for i, strategy in enumerate(strategies):
            try:
                result = await strategy(job, materials)
                if result.success:
                    log.info(f"Succeeded with strategy {i}: {strategy.__name__}")
                    return result
            except Exception as e:
                log.warning(f"Strategy {strategy.__name__} failed: {e}")
                continue

        return Result(
            status="skipped",
            reason="All strategies exhausted — flagged for manual review"
        )

The principle: each fallback should be simpler than the last. The final fallback is always a human — either flagging the application for manual review or skipping it entirely with a clear explanation to the candidate.

Validation Is Not Optional

Every output from every LLM call goes through at least two validation layers:

Schema validation: Does the output match the expected structure? This catches ~60% of malformed outputs. A resume personalization that returns JSON with missing fields gets caught immediately.

Semantic validation: Does the output make sense? I use a smaller, cheaper model to verify the primary model's work. "Does this cover letter reference experience that actually exists in the candidate's profile?" This catches the hallucination problem.

The cost of running a validation model on every output is roughly 10% of the primary inference cost. The cost of submitting a hallucinated application is a candidate's reputation. The math is obvious.

The Ethics of Scale

Resumine raises a question I think about constantly: what happens when applying to jobs has zero friction?

Historically, the effort of writing a cover letter and filling out a form served as a signal of genuine interest. When an AI agent eliminates that friction, every application becomes cheap. This could flood recruiters with applications, degrading the signal for everyone — including the candidates using the tool.

My approach is deliberate constraint. Resumine doesn't spray applications indiscriminately. The scoring agent is intentionally strict — it only passes through opportunities where the confidence score exceeds 0.8. Quality over quantity. Each application should be one the candidate would be proud to send manually.

This is a design choice, not a technical limitation. The system could apply to 1,000 jobs a day. But it shouldn't.

What I've Learned

Building Resumine has reshaped how I think about AI engineering. Here are the lessons that generalize beyond job applications:

  1. Architecture > Model. The choice of GPT-4 vs. Claude vs. Gemini matters far less than the structure around it. A mediocre model in a well-designed pipeline outperforms a frontier model in a naive loop.
  2. Start with one agent. Multi-agent frameworks are seductive but premature. I started with a monolithic pipeline and only split into specialized agents when specific bottlenecks demanded it.
  3. Measure completion rate, not capability. Your agent can do impressive things in isolation. The only metric that matters is end-to-end task completion across 1,000 diverse inputs.
  4. Build for the failure case. The happy path is easy. The 10% of cases where things go wrong determine whether your system is a product or a toy.
  5. Ethics aren't an afterthought. If your agent can do something at scale that would be harmful at scale, the architecture must include constraints. "We could but we shouldn't" is an engineering requirement, not a philosophical musing.

Resumine is still in active development. The 85% completion rate needs to reach 95% before I'd consider it truly production-ready. But the lessons from building it have made me a fundamentally better engineer — not just at AI, but at building systems that must work reliably in an unreliable world.

The gap between a demo and a product isn't features. It's the thousand ways things can go wrong, and your plan for each one.