Fuse Data | Blog | Page 4 of 22

Right Model, Right Task

Over the past few weeks, we’ve been talking a lot about LLMs, cost, and real-world

Two people choosing between three smiling robots labeled GPT-5 and GPT-MINI in a grassy setting. Speech bubble reads, "We'll pick the GPT-5 mini."

practicality.

From the hidden costs of foundational model providers to the diminishing returns of scaling large language models, the message has been clear: raw power inference power is only the one dimension to consider. And inference power certainly isn’t cheap.

But there’s good news. Thoughtful design choices can dramatically reduce the cost of deploying GenAI at scale.

In this three-part series, we’ll explore practical ways to build GenAI agents that deliver results without draining your budget, including:

Matching the right model to the right task
Training smaller models with data generated by larger ones
Using model caching to avoid redundant inference

Part 1: Right Model, Right Task

If you’re building anything agentic — copilots, assistants, planners, explainers — one of the best ways to control costs is by not using your most powerful model for everything.

Let’s walk through a real example.

At Fuse, we’re building an agentic data analyst that allows users to query their data warehouse using natural language. The system needs to understand vague or ambiguous user questions and turn them into a set of SQL queries that can be executed in parallel.

Take a question like:

“How’s the business doing out west?”

This is vague. There’s no metric. No time frame. The system needs to:

Identify missing context using the metadata in the warehouse (e.g. what "west" might mean geographically)
Break the question into several precise sub-questions (e.g. sales, margin, customer churn)
Generate optimized SQL for each sub-question
Run the queries in parallel
Synthesize the results into a single narrative that answers the original prompt

To do this, we use a tool-calling LLM to orchestrate the process. That orchestration step requires advanced reasoning, so we use Gemini Pro.

But synthesis? That’s a different job.

Once we have the answers, the task becomes: "write a coherent summary that references these values and aligns to the original intent."

That’s a great use case for a smaller model. In our case, we use Gemini Flash, which:

Has much lower latency
Costs significantly less per 1M tokens
Performs very well on summarization and synthesis

So we get the best of both worlds:

High reasoning accuracy when it matters most (Gemini Pro)
Faster, cheaper inference when the task is well-defined (Gemini Flash)

This pattern shows up everywhere in agent design. Consider:

Using a large model to plan a task, and a smaller one to execute it
Validating inputs with a smaller model before escalating to a larger one
Reserving premium models for ambiguous or high-impact flows only

In short?

Don’t use a sledgehammer when a screwdriver will do.

Thoughtful model selection is one of the fastest ways to cut GenAI costs without degrading your user experience or accuracy.

In the next post, we’ll look at how to use large models to train smaller ones, so you can build lean, high-performing agents without paying premium inference prices forever.

At Fuse, we believe a great data strategy only matters if it leads to action.

If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.

👉 Book a strategy call

👉 Read more on the Fuse blog

👉 Follow Fuse Data on LinkedIn

👉 Follow Dave on LinkedIn

You’re Already Paying OpenAI. But Are You Paying Them 12x?

LLMs are expensive. You probably know that.

Smiling cartoon person labeled OpenAI with icons for ChatGPT, CRM, AI Tool, Email interacts with a man in a simple, shaded setting.

But you might be paying for them a lot more than you realize.

Not just once. Not just through OpenAI or Anthropic.

You’re paying them five times or more, through half the SaaS tools in your stack.

And the bill? It’s growing.

How We Got Here

It’s never been easier to spin up a GenAI feature.

Vendors wrap OpenAI, Anthropic, or Gemini in a slick UI… and call it a productivity tool.

Meanwhile, every SaaS platform is shipping AI features at speed. Features that you didn't ask for and likely don't use or need:

Slack messages that summarize themselves
CRMs that write outreach emails
Docs that auto-generate bullet points

And almost all of them are just passing your data to a foundation model API under the hood.

You don’t control the model. You don’t know how much of your spend is tied to it. And

you’re probably paying for the same model multiple times.

Example Stack: Paying Twice (or More)

Let’s make it concrete.

You’re a mid-sized company with a modern SaaS stack:

Notion AI for meeting summaries
Slack AI for chat and search
Glean AI for knowledge base answers
HubSpot AI assistant for marketing
Grammarly GO for writing help

All five tools claim AI benefits. But guess what?

They all rely on OpenAI or Anthropic under the hood.

So now you're paying 5x markup on inference costs, with no visibility and no control.

Where the Money’s Going

VCs have taken notice:

The money in AI is not in startups — it's flowing to NVIDIA, OpenAI, and the big cloud providers.

Anthropic just raised funding at a $18.4B valuation.

These are not public utilities. They are some of the most expensive suppliers in your digital supply chain.

This Isn’t Sustainable

Three compounding problems are brewing:

Redundant Spend: You're paying for the same inference multiple times, via SaaS middlemen.
Black Box Pricing: Most vendors don't expose LLM usage, model choice, or cost drivers.
Vendor Lock-in: You're stuck with the models your tools chose — not the ones that suit your needs.

Meanwhile, the costs of compute and energy are rising. At scale, this becomes a real business risk.

What Can You Do?

As a buyer:

Ask vendors which foundation models they use
Push for LLM usage reporting and controls
Audit your stack for redundant model access

As a builder:

Explore Small Language Models (SLMs) for edge or embedded tasks
Partner directly with model providers to cut out the middle layer
Design your own memory and retrieval layer to optimize context use

Final Thought

The model you choose matters. But how many times you're paying for it might matter even more.

Better architecture, smarter design, and conscious procurement can reduce your exposure.

And remember: the model isn’t your only cost.

Design decisions become budget decisions.

It’s time to treat them that way.

At Fuse, we believe a great data strategy only matters if it leads to action.

If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.

👉 Book a strategy call

👉 Read more on the Fuse blog

👉 Follow Fuse Data on LinkedIn

👉 Follow Dave on LinkedIn

The Case for Smaller Models

LLMs keep getting bigger and so do the problems.

A cartoon of two boxers in a ring, one tall with "LLM" on shorts, the other shorter with "SLM." Both wear gloves, facing off, intense mood.

This year alone, companies will pour over $40 billion into AI infrastructure. McKinsey estimates the total cost of compute scaling could reach $1 trillion by 2030, and $3.3 trillion if everyone tries to build their own stack.

That’s not just unsustainable. It’s unnecessary.

Because while large models are impressive, they’re not always practical. And for most real-world use cases, they might not even be the right tool for the job.

The Limits of Scale

As LLMs scale into the hundreds of billions (or even trillions) of parameters, several hard problems emerge:

Soaring Cost of Compute

Inference and fine-tuning on large models require massive GPU clusters and energy draw, making real-time deployment cost-prohibitive for most organizations.

Latency

Big models are slow. Even with quantization and batching, response time often lags, which makes them less useful in interactive or real-time scenarios.

Data Center Bottlenecks

We’re hitting physical and economic limits in the availability of GPUs, networking, and power. Scaling from here requires massive investment, often with diminishing returns.

Accessibility Gaps

Startups, researchers, and smaller teams get priced out. Innovation becomes concentrated in the hands of those with the deepest pockets.

Environmental Impact

More tokens, more watts. Large-scale training and inference contributes significantly to energy consumption and those costs will only grow with demand.

If this sounds unsustainable, that’s because it is.

So what’s the alternative?

Smaller Models. Smarter Use.

Small language models (SLMs), typically under 10B parameters, are having a moment. And for good reason.

Benefits of SLMs:

Lower cost — Train and run on consumer GPUs or small clusters
Faster inference — Suitable for real-time and edge applications
Easier to deploy — More portable, less dependent on proprietary infrastructure
Fine-tuneable — Adapt to specific domains without retraining giants
Lower energy footprint — A meaningful step toward greener AI

And when paired with the right architecture and high-quality data, SLMs can punch well above their weight.

Small Models Are Making Big Moves

The belief that “bigger is better” in AI is being challenged. In 2025, we’ve seen remarkable progress in compact, efficient small language models (SLMs). Models that are not only cheaper to run, but surprisingly powerful.

Here are some of the most promising developments:

Microsoft Phi-4 Series
The Phi family continues to impress, especially the new Phi-4-Mini-Flash, which delivers up to 10× faster inference and excels at reasoning, all in a footprint small enough for local or edge deployment.
Google Gemma 3n
Expanded into multiple sizes (1B–27B), with the Gemma 3n optimized specifically for devices like laptops and tablets, enabling safer, fine-tunable models at the edge.
OLMo-2 (AI2)
One of the most transparent models to date, fully open, with shared training data, logs, and evaluation tooling. The 32B variant is setting new standards in open research and reproducibility.
Mistral Small & Magistral
These European-built models offer 128K-token context, reasoning capabilities, and a strong open-source roadmap, proving that high-performance doesn’t have to mean high-resource.
Energy-Efficient Research
Studies show architectural changes can reduce energy consumption of small models by up to 90% without compromising performance, making them the obvious choice for sustainable AI.

Final thought

The age of ever-larger models is giving way to something more practical, more efficient, and more accessible.

It’s not about squeezing a trillion parameters into your stack.

It’s about building fit-for-purpose models that are:

Light enough to run on real infrastructure
Smart enough to deliver real results
Transparent enough to build trust
And efficient enough to scale sustainably

In the end, smaller doesn’t mean weaker.

It means focused.

Purposeful.

And ready for the real world.

At Fuse, we believe a great data strategy only matters if it leads to action.

If you’re ready to move from planning to execution — and build solutions your team will actually use — let’s talk.