Technology

Scaling AI at Allica: Part II

By Richard Davies & Product Engineering Team

8 minute read · 29 June 2026

AI only creates real advantage when the organisation changes around it: the tools, data, team structures, design system, engineering standards and controls all compound together:

The key building blocks

AI coding tools by default: Github Copilot, Codex, and Claude Code are now core to all daily work across Product, Design and Engineering (ProdEng). The conversation has moved well beyond using ChatGPT in a browser; all ProdEng colleagues are using coding agents as active contributors to their workflows.

Full-stack engineer role: We made an early bet that we needed to push towards T-shaped / full-stack roles. That’s now showing in the numbers (measured by range of code output): 54% of engineers are already operating as T-shaped contributors, 36% are working towards that model, and 10% remain specialist (certain roles will remain so). Depth still matters - it becomes more valuable paired with breadth.

Allica Alchemy: our bespoke LLM-native design system, built to make Allica’s foundations, components and patterns usable by any LLM and to support new ways of working over time. It is already making prototyping easier, replacing generic tools that don’t know our product and design approach, and we are working on full prompt to production code.

Shareable skills: A clever prompt in one person’s notes doesn’t scale. A reviewed and versioned shared skill in GitHub can. Skills encode repeatable tasks, standards, scripts and safety checks so colleagues reuse and improve them, rather than starting from scratch.

Voice-of-customer agent: We are bringing together our existing voice and email agents, and extending to cover the full range of customer conversations from live chat, secure messaging, CRM, and external review sites. This enables hyper detailed linkage of customer input straight into product design and roadmap prioritisation, allows for in-depth understanding of customer sentiment, and ensures operational quality control, etc.

Data agent: We are completing migration of our data lake to GCP, and our data agents will then allow any colleague across the bank to source the data they need themselves through natural language prompts, reducing bottlenecks and putting better information in front of decisions at the point they’re being made.

Engineering harness: Tests, lint, security scans and team standards make agent work safe to ship. Repository-level instructions tell AI tools how each codebase works. Claude Code hooks intercept risky commands even in auto mode - read-only commands flow; commands that mutate infrastructure or data require explicit human confirmation. In a regulated bank, auto mode is a convenience, not an authority.

Going deeper

AI-first does not mean anything goes

As capability spreads, the challenge changes.

When only engineers could make software changes via their own code, the control model was comparatively straightforward. When product owners, designers and engineers can all use agents to generate code, dashboards, prototypes, workflows and documentation, the system needs stronger defaults.

This is not a reason to slow down. It is a reason to make the guardrails better.

One practical lesson is that AI-generated work must arrive for review in a better state. A pull request that has not passed automated tests, followed repository conventions or checked basic review comments does not create leverage. It simply moves unfinished work onto someone else’s desk.

That is why the engineering harness matters so much. Build checks, test suites, linting, security scans, repository instructions and AI code review are not administrative overhead. They are the control layer that allows more people and more agents to contribute safely.

Repository-level instructions such as AGENTS.md, copilot-instructions.md, CLAUDE.md and related agent documentation tell AI tools how each codebase works: what commands to run, how common tasks are done, and which standards matter. The better the house rules, the less the agent has to guess.

We have also introduced stronger controls for agent behaviour itself. For example, Claude Code hooks intercept risky commands even when an agent is running in auto mode. Read-only commands can flow; commands that mutate infrastructure or data can stop and ask for explicit human confirmation.

One incident stands out as a useful illustration of how agentic coding can go wrong in ways that are hard to anticipate. We had built a skill for an agent to process cases in a controlled way, with the task parameters housed in Jira tickets. Rather than following the skill, the agent concluded that the best route to the outcome was to write its own bespoke code, calling APIs outside the skill, and without flagging this to the person who had prompted it.

When we asked it afterwards, it was disarmingly honest: it acknowledged it hadn't used the skill, that doing so had been a mistake. The potentially serious moment came when it encountered a step in the process labelled "archive customer" - and rather than treating this as a workflow label, it found the API that would actually archive the customer record in our core system and began attempting to do so. It was, however, blocked by our wider controls, and no customers were affected.

The episode captures something important about how agents reason: they work towards intent. Given an end goal, an agent will find a path to that goal, and the path can create unintended risk. The lesson isn't simply that prompts need to be more specific - though they do - it's that instructions need to constrain latitude explicitly, and there need to be wider controls in case the agent does go off-piste.

Product and design quality need taste, not just tests

A pull request can pass automated tests and still be the wrong product experience. A prototype can use the right components and still have a confusing journey. A dashboard can answer the question asked and still encourage the wrong behaviour.

If AI makes it easier for anyone to create screens, flows, dashboards or product changes, then we need strong product and design judgment in the loop.

In Design, we have implemented a useful precedent: weekly make-or-break sessions where designers review work from designers, product owners and even engineers, challenge flows and maintain the bar for quality. As AI-first working spreads, we need the same principle applied more broadly. This isn’t necessarily more process, but is pushing for clearer taste.

So the next stage of AI-first working is not just about making more people productive. It is about helping more people develop and apply good product taste: what is clear, what is useful, what feels coherent, what protects the customer, and what is worthy of being shipped.

That is especially important as product owners and designers move closer to production. The point is not to replace specialist judgment. It is to make that judgment more scalable.

Allica Alchemy is becoming our prompt-to-code platform

The design system is central to this.

Generic prototyping tools are useful for getting an idea out quickly, but they do not know Allica’s product patterns, components, accessibility standards or production constraints. That is why we are discontinuing the use of v0. All prototyping is moving through Allica Alchemy, our Allica-native prompt-to-code environment: local setup, approved components, shared skills and design-system knowledge that lets colleagues build in the way our production systems expect.

The core idea is simple: we do not need to teach LLMs everything about design. They already know a lot. We need to teach them what is specific to Allica.

That means encoding foundations such as colour, type and spacing; reusable components such as buttons, forms and list items; patterns and templates for common journeys; design rules and guidance through source-of-truth documentation; and skills that automate repeatable design-system tasks.

We have already seen strong progress. Figma-to-code output has produced high-quality results in the design-system environment, and the team has improved output by refining components, adding common mistakes to the design-system skill and creating new skills.

Mobile has also moved quickly. As of late April, there was no Allica Alchemy mobile library; the team then semi-automated Flutter components from the Allica Alchemy web components, getting to parity between web and mobile.

That is a good example of what AI-first working changes. The gain is not just that one task becomes faster. It is that a system starts to compound: better components produce better agent output; better skills reduce repeated mistakes; better patterns help more people build consistently.

Voice of Customer: building faster only matters if we build the right things

Our Growth squad built a Voice of Customer agent covering customer calls and emails early this year in order to have full depth of insight into customer feedback on our product, understand key reasons why customers use Allica as a primary vs secondary account, understand concerns raised in sales calls, etc. Various other teams around Allica have leveraged this since.

We are now extending that so there is a single interface that can access across customer data from voice calls, emails, live chat, secure messaging, CRM, and external review sites. This will allow all product squads to have full trace of existing and prospect customer insight to drive product roadmap and build.

This is an important complement to faster delivery. AI can help us generate PRDs, create prototypes and ship changes. But without strong customer signal, speed alone is not enough.

There are key considerations in building this. Customer data brings privacy and access considerations, so the capability has to be built with the right controls such that PII is not accessible from the top level agent interface.

And with an absolutely huge range of raw customer data, this needs to be appropriately structured and indexed to allow efficient use by the top level agent, without massive token consumption on any request.

The outcome we want is better product judgment, grounded in real customer needs, available to more colleagues at the point they are making decisions.

Measuring value, not activity

The first blog described how we looked at pull requests, release velocity and AI adoption as indicators of progress. Those still matter. But as AI increases intermediate outputs like these, we need sharper measures of output value.

That is where Positive Product Increments (PPIs) come in.

A Positive Product Increment is a change that clearly delivers meaningful new value: for customers, or Allica through stronger efficiency or lower risk.

The aim is to avoid mistaking activity for progress. A team can ship more tickets without materially improving the product. Equally, a technical change may be extremely valuable if it improves security, permissions, resilience or future delivery speed. The measure needs to recognise both customer-facing and enabling work.

AI helps by scanning release tickets, linked engineering tickets, initiatives and epics, then classifying which changes appear to be positive product increments. Human judgment remains essential. Heads of Product review and sign off the result.

Our internal discussion around PPIs surfaced two useful lessons.

First, the quality of the underlying work data matters. If Jira descriptions are weak, the AI has less to work with.

Second, count alone is not enough. Twenty small changes are not the same as one major new capability. Over time, we plan to connect PPIs to Allica’s squad and company level OKRs.

That is the right direction. AI should help us see more clearly whether we are actually increasing meaningful product output, not just increasing the number of things moving through the system.

Role convergence, with control

As already mentioned, the big organisational shift happening from AI is role convergence.

Product Owners are becoming more capable of creating prototypes, dashboards, tickets and small code changes. Designers are becoming builders. Engineers are becoming more full-stack and more agent-native. Senior engineers are spending more of their time encoding judgment into standards, skills, tests and reusable patterns.

The bar does not come down because more people can contribute. The system around contribution gets stronger: better components, clearer patterns, automated checks, human approval, shared skills and stronger product taste.

For engineers, this is one of the highest-leverage versions of the role. Their expertise becomes a multiplier. They can delegate research, specs, tests, migrations, documentation and implementation to agents, then focus their attention where judgment matters most: architecture, security, systems quality and product trade-offs.

For product and design, the opportunity is similar. Less time waiting for handoffs. More ability to shape the thing directly. More responsibility for making sure what gets built is useful, coherent and customer-centred.

What this adds up to

The first phase of scaling AI was about adoption. Could colleagues use the tools? Would usage spread? Would it change daily work?

The answer is yes.

The current phase is about fully robust capability. Can we make AI-first working repeatable, safe and high quality? Can we have role convergence without losing product taste? Can we measure value rather than motion? Can we build the right customer signal into the way we decide what to build?

The underlying LLMs will keep changing rapidly. Tools we use today will be replaced, workflows will be redesigned again. But the durable advantage is not any one model or interface. It is the system we are building around them.

AI-first does not mean moving fast and hoping the quality follows. It means designing the organisation so speed, safety and quality reinforce each other.

Blog

Scaling AI at Allica: Part II

The key building blocks

Going deeper

AI-first does not mean anything goes

Product and design quality need taste, not just tests

Allica Alchemy is becoming our prompt-to-code platform

Voice of Customer: building faster only matters if we build the right things

Measuring value, not activity

Role convergence, with control

What this adds up to

Further reading

The Growth Guarantee Scheme is getting bigger and better

Commercial mortgage deposits: a complete guide

BACS payment guide: how it works and how long it takes

What is an EOT? Employee Ownership Trusts explained

All our bots are busy right now

0330 094 3333