top of page

Over the past few weeks, we’ve been talking a lot about LLMs, cost, and real-world

Two people choosing between three smiling robots labeled GPT-5 and GPT-MINI in a grassy setting. Speech bubble reads, "We'll pick the GPT-5 mini."

practicality.


From the hidden costs of foundational model providers to the diminishing returns of scaling large language models, the message has been clear: raw power inference power is only the one dimension to consider. And inference power certainly isn’t cheap.


But there’s good news. Thoughtful design choices can dramatically reduce the cost of deploying GenAI at scale.


In this three-part series, we’ll explore practical ways to build GenAI agents that deliver results without draining your budget, including:


  1. Matching the right model to the right task

  2. Training smaller models with data generated by larger ones

  3. Using model caching to avoid redundant inference



Part 1: Right Model, Right Task


If you’re building anything agentic — copilots, assistants, planners, explainers — one of the best ways to control costs is by not using your most powerful model for everything.


Let’s walk through a real example.


At Fuse, we’re building an agentic data analyst that allows users to query their data warehouse using natural language. The system needs to understand vague or ambiguous user questions and turn them into a set of SQL queries that can be executed in parallel.


Take a question like:

“How’s the business doing out west?”

This is vague. There’s no metric. No time frame. The system needs to:


  1. Identify missing context using the metadata in the warehouse (e.g. what "west" might mean geographically)

  2. Break the question into several precise sub-questions (e.g. sales, margin, customer churn)

  3. Generate optimized SQL for each sub-question

  4. Run the queries in parallel

  5. Synthesize the results into a single narrative that answers the original prompt


To do this, we use a tool-calling LLM to orchestrate the process. That orchestration step requires advanced reasoning, so we use Gemini Pro.


But synthesis? That’s a different job.


Once we have the answers, the task becomes: "write a coherent summary that references these values and aligns to the original intent."


That’s a great use case for a smaller model. In our case, we use Gemini Flash, which:


  • Has much lower latency

  • Costs significantly less per 1M tokens

  • Performs very well on summarization and synthesis


So we get the best of both worlds:


  • High reasoning accuracy when it matters most (Gemini Pro)

  • Faster, cheaper inference when the task is well-defined (Gemini Flash)


This pattern shows up everywhere in agent design. Consider:


  1. Using a large model to plan a task, and a smaller one to execute it

  2. Validating inputs with a smaller model before escalating to a larger one

  3. Reserving premium models for ambiguous or high-impact flows only


In short?


Don’t use a sledgehammer when a screwdriver will do.


Thoughtful model selection is one of the fastest ways to cut GenAI costs without degrading your user experience or accuracy.


In the next post, we’ll look at how to use large models to train smaller ones, so you can build lean, high-performing agents without paying premium inference prices forever.



At Fuse, we believe a great data strategy only matters if it leads to action.


If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.


LLMs are expensive. You probably know that.

Smiling cartoon person labeled OpenAI with icons for ChatGPT, CRM, AI Tool, Email interacts with a man in a simple, shaded setting.

But you might be paying for them a lot more than you realize.


Not just once. Not just through OpenAI or Anthropic.


You’re paying them five times or more, through half the SaaS tools in your stack.


And the bill? It’s growing.



How We Got Here


It’s never been easier to spin up a GenAI feature.


Vendors wrap OpenAI, Anthropic, or Gemini in a slick UI… and call it a productivity tool.


Meanwhile, every SaaS platform is shipping AI features at speed. Features that you didn't ask for and likely don't use or need:

  • Slack messages that summarize themselves

  • CRMs that write outreach emails

  • Docs that auto-generate bullet points


And almost all of them are just passing your data to a foundation model API under the hood.


You don’t control the model. You don’t know how much of your spend is tied to it. And

you’re probably paying for the same model multiple times.



Example Stack: Paying Twice (or More)


Let’s make it concrete.


You’re a mid-sized company with a modern SaaS stack:

  • Notion AI for meeting summaries

  • Slack AI for chat and search

  • Glean AI for knowledge base answers

  • HubSpot AI assistant for marketing

  • Grammarly GO for writing help


All five tools claim AI benefits. But guess what?


They all rely on OpenAI or Anthropic under the hood.


So now you're paying 5x markup on inference costs, with no visibility and no control.



Where the Money’s Going


VCs have taken notice:


The money in AI is not in startups — it's flowing to NVIDIA, OpenAI, and the big cloud providers.

Anthropic just raised funding at a $18.4B valuation.


These are not public utilities. They are some of the most expensive suppliers in your digital supply chain.



This Isn’t Sustainable


Three compounding problems are brewing:


  • Redundant Spend: You're paying for the same inference multiple times, via SaaS middlemen.

  • Black Box Pricing: Most vendors don't expose LLM usage, model choice, or cost drivers.

  • Vendor Lock-in: You're stuck with the models your tools chose — not the ones that suit your needs.


Meanwhile, the costs of compute and energy are rising. At scale, this becomes a real business risk.



What Can You Do?


As a buyer:

  • Ask vendors which foundation models they use

  • Push for LLM usage reporting and controls

  • Audit your stack for redundant model access


As a builder:

  • Explore Small Language Models (SLMs) for edge or embedded tasks

  • Partner directly with model providers to cut out the middle layer

  • Design your own memory and retrieval layer to optimize context use


Final Thought


The model you choose matters. But how many times you're paying for it might matter even more.


Better architecture, smarter design, and conscious procurement can reduce your exposure.

And remember: the model isn’t your only cost.


Design decisions become budget decisions.


It’s time to treat them that way.



At Fuse, we believe a great data strategy only matters if it leads to action.


If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.


LLMs keep getting bigger and so do the problems.

A cartoon of two boxers in a ring, one tall with "LLM" on shorts, the other shorter with "SLM." Both wear gloves, facing off, intense mood.

This year alone, companies will pour over $40 billion into AI infrastructure. McKinsey estimates the total cost of compute scaling could reach $1 trillion by 2030, and $3.3 trillion if everyone tries to build their own stack.


That’s not just unsustainable. It’s unnecessary.


Because while large models are impressive, they’re not always practical. And for most real-world use cases, they might not even be the right tool for the job.



The Limits of Scale


As LLMs scale into the hundreds of billions (or even trillions) of parameters, several hard problems emerge:


Soaring Cost of Compute

Inference and fine-tuning on large models require massive GPU clusters and energy draw, making real-time deployment cost-prohibitive for most organizations.


Latency

Big models are slow. Even with quantization and batching, response time often lags, which makes them less useful in interactive or real-time scenarios.


Data Center Bottlenecks

We’re hitting physical and economic limits in the availability of GPUs, networking, and power. Scaling from here requires massive investment, often with diminishing returns.


Accessibility Gaps

Startups, researchers, and smaller teams get priced out. Innovation becomes concentrated in the hands of those with the deepest pockets.


Environmental Impact

More tokens, more watts. Large-scale training and inference contributes significantly to energy consumption and those costs will only grow with demand.


If this sounds unsustainable, that’s because it is.


So what’s the alternative?



Smaller Models. Smarter Use.


Small language models (SLMs), typically under 10B parameters, are having a moment. And for good reason.


Benefits of SLMs:


  • Lower cost — Train and run on consumer GPUs or small clusters

  • Faster inference — Suitable for real-time and edge applications

  • Easier to deploy — More portable, less dependent on proprietary infrastructure

  • Fine-tuneable — Adapt to specific domains without retraining giants

  • Lower energy footprint — A meaningful step toward greener AI


And when paired with the right architecture and high-quality data, SLMs can punch well above their weight.



Small Models Are Making Big Moves


The belief that “bigger is better” in AI is being challenged. In 2025, we’ve seen remarkable progress in compact, efficient small language models (SLMs). Models that are not only cheaper to run, but surprisingly powerful.


Here are some of the most promising developments:


  • Microsoft Phi-4 Series

    The Phi family continues to impress, especially the new Phi-4-Mini-Flash, which delivers up to 10× faster inference and excels at reasoning, all in a footprint small enough for local or edge deployment.

  • Google Gemma 3n

    Expanded into multiple sizes (1B–27B), with the Gemma 3n optimized specifically for devices like laptops and tablets, enabling safer, fine-tunable models at the edge.

  • OLMo-2 (AI2)

    One of the most transparent models to date, fully open, with shared training data, logs, and evaluation tooling. The 32B variant is setting new standards in open research and reproducibility.

  • Mistral Small & Magistral

    These European-built models offer 128K-token context, reasoning capabilities, and a strong open-source roadmap, proving that high-performance doesn’t have to mean high-resource.

  • Energy-Efficient Research

    Studies show architectural changes can reduce energy consumption of small models by up to 90% without compromising performance, making them the obvious choice for sustainable AI.



Final thought


The age of ever-larger models is giving way to something more practical, more efficient, and more accessible.


It’s not about squeezing a trillion parameters into your stack.


It’s about building fit-for-purpose models that are:

  • Light enough to run on real infrastructure

  • Smart enough to deliver real results

  • Transparent enough to build trust

  • And efficient enough to scale sustainably


In the end, smaller doesn’t mean weaker.


It means focused.


Purposeful.


And ready for the real world.



At Fuse, we believe a great data strategy only matters if it leads to action.


If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.


fuse data logo
bottom of page