brianletort.ai
All Posts
AI ArchitectureEnterprise AIAI StrategyAI SystemsInfrastructure

The Enterprise Model Portfolio

The answer to the token economics problem isn't one model — it's a portfolio of six specialized model types served as internal API services. Near-frontier open models now handle 80–90% of enterprise tasks at a fraction of the cost.

April 12, 202613 min read

TL;DR

  • The answer to the token economics problem isn't one model — it's a portfolio of 5–6 specialized model types served as internal API services
  • Near-frontier open models (Qwen 3.5, Nemotron 3 Super, DeepSeek V3) now match frontier models on 80–90% of enterprise tasks at a fraction of the cost
  • Frontier models remain essential for complex reasoning, synthesis, and distillation — but they should be the teacher, not the entire workforce
  • The model portfolio pattern: reasoning models for hard problems, utility models for volume, embedding models for search, rerankers for precision, code models for development, vision models for multimodal
  • Enterprises should think of this as an internal model marketplace — API keys, rate limits, cost tracking, and routing policies

In Part 1, I showed the math: a large enterprise could face $50–100M per year in token costs at API pricing. That number is real. And it will only grow as agentic workflows become standard and personal AI systems become continuous.

But the solution isn't just "host it yourself and save money."

The solution is using the right model for every task.

You wouldn't run every database query on your most expensive compute tier. You wouldn't route every HTTP request through your most powerful server. The same logic applies to AI — and most enterprises haven't internalized it yet.

Here's the paradox that a16z's "LLMflation" analysis makes clear: inference costs are declining roughly 10x per year — a 1,000x reduction over three years. Models that scored 42 on MMLU cost $60 per million tokens in 2021 and just $0.06 today. Yet enterprise AI spending is accelerating. Unit costs drop while total spending climbs, because consumption at Level 3–5 of the maturity continuum grows faster than costs decline.

The answer isn't cheaper tokens. It's a portfolio that matches the right model to every workload.

Near-Frontier Models Have Crossed the Threshold

For the first two years of the generative AI era, there was a meaningful gap between frontier models and everything else. If you wanted real quality, you needed GPT-4, Claude, or Gemini. Open-weight alternatives were interesting research projects but not production-ready.

That is no longer true.

The open model ecosystem has matured dramatically in 2025–2026. A new generation of near-frontier models — many using Mixture of Experts (MoE) architectures that activate only a fraction of their parameters per request — now deliver frontier-competitive quality at a fraction of the compute cost. The February 2026 Open-Source LLM Leaderboard tells the story.

Near-Frontier Open Models (February 2026 Leaderboard)

Open-weight models matching or exceeding frontier quality — benchmarks from the Open-Source LLM Leaderboard

ModelTotalActiveArchKey StrengthCost
DeepSeek V3.2
685B37BMoEMMLU-Pro 84.1%, SWE-Bench 72.4%$
Llama 4 Maverick
400B17BMoE1M context, multimodal, Apache 2.0$$
Qwen 3.5-122B
122B10BMoEMMLU-Pro 86.7% — frontier-class at 10B active$
Mistral Large 3
123B123BDenseMMLU 84.0%, Apache 2.0, 256K ctx$$
Qwen 3.5-35B
35B3BMoE + DeltaNetMMLU-Pro 85.3%, native vision + video, 262K ctx$
Nemotron 3 Nano
30B3BMamba2 Hybrid3.3x faster inference, 86.3% RULER @ 1M ctx$
Phi-4
14B14BDenseMATH 80.4% (beats GPT-4o), MIT license$
Gemma 3
27B27BDenseStrong fine-tuning, permissive license$

DeepSeek V3.2 (685B total, 37B active via MoE). Leads the open-source leaderboard with MMLU-Pro 84.1% and SWE-Bench 72.4% — matching or exceeding frontier proprietary models. Features "Thinking in Tool-Use" for coherent multi-step planning. And it disrupted the pricing landscape: $0.55 per million input tokens — 90% below Western competitors for comparable quality. Released under MIT license, making it one of the most permissive frontier-class models available.

Llama 4 Maverick (400B total, 17B active via MoE). Meta's flagship open model brings 1 million token context windows and native multimodal capabilities under Apache 2.0. The largest open-weight ecosystem in the industry — more fine-tunes, more tooling, more deployment guides than any other model family. If your team has experience with Llama, Maverick is the natural step up to frontier-competitive performance.

Qwen 3.5-122B (122B total, only 10B active via MoE). This is the model that should get every enterprise architect's attention. It scores MMLU-Pro 86.7% and GPQA Diamond 86.6% — outperforming the full DeepSeek V3.2 on graduate-level reasoning despite activating 3.7x fewer parameters. It also achieves SWE-Bench 72.0% with native multimodal support. Frontier-class intelligence at a fraction of the compute. This is what the economics of 2026 look like.

Mistral Large 3 (123B dense). Mistral's enterprise workhorse — 84.0% MMLU with 256K token context under Apache 2.0. Dense architecture means simpler deployment than MoE models: no expert routing complexity, predictable memory usage, consistent inference speed. It's become the default choice for organizations that want strong general-purpose performance without operational complexity.

Qwen 3.5-35B-A3B (35B total, only 3B active via MoE + Gated DeltaNet). This model broke the scaling myth. With just 3 billion active parameters, it achieves MMLU-Pro 85.3%, GPQA Diamond 84.2%, and SWE-Bench 69.2% — outperforming models with 22B active parameters on reasoning, coding, vision, and agentic tasks. It combines 256 experts (8 routed per token + 1 shared) with a hybrid Gated DeltaNet/softmax attention architecture across 40 layers. Natively multimodal (text, images, video) with a 262K context window extendable to 1M. Supports 201 languages. At 3B active params, it runs on hardware that most enterprises already have — no H100 cluster required. This is the utility tier's future: frontier-quality intelligence on commodity infrastructure.

Nemotron 3 Nano (30B total, 3B active, Mamba2 hybrid architecture). NVIDIA's throughput-optimized model delivers 3.3x faster inference than comparable models and scores 86.3% on RULER at 1 million tokens — purpose-built for long-context agentic workloads. The Mamba2 architecture replaces attention with structured state-space models, meaning inference cost scales linearly with context length rather than quadratically. For high-throughput enterprise deployments processing millions of requests daily, that architectural difference translates directly to GPU savings.

Phi-4 (14B dense, Microsoft). The best small model in the ecosystem. Scores 80.4% on MATH — higher than GPT-4o's 74.6%. Released under MIT license. Purpose-built for STEM reasoning, structured data analysis, and domain-specific tasks where a 14B model running on a single consumer GPU can match or beat models 10x its size. The economics are transformative: run it on a $2,000 RTX 4090 instead of renting H100 time.

Gemma 3 (27B dense, Google). Strong baseline performance with permissive licensing and excellent fine-tuning characteristics. Google's research pedigree means the architecture is well-documented and the community tooling is mature. A solid choice for organizations that want to fine-tune domain-specific models without licensing headaches.

The critical insight here is not that these models beat frontier on everything. They don't. The insight is that they are good enough for 80–90% of enterprise workloads — the high-volume, routine tasks that account for the vast majority of token consumption.

Forrester predicts that 30% of enterprise application vendors will launch MCP (Model Context Protocol) servers in 2026, enabling cross-platform agentic workflows. That interoperability layer accelerates the multi-model portfolio pattern — every vendor endpoint becomes part of the routing topology.

AT&T's Chief Data Officer, Andy Markus, captured this well: smaller models can be "just about as accurate, if not as accurate, as a large language model on a given domain area."

The frontier models still matter. But they should not be handling every request.

The Six Model Types Every Enterprise Needs

Here's the portfolio I've come to believe every serious enterprise AI deployment needs. Not one model. Six types — each optimized for a different class of work, each running as a distinct service.

The Enterprise Model Portfolio

Six model types organized by volume and cost — hover to explore

← Cost per Token

Reasoning / Synthesis

The Strategists

5–10% of volume

Highest cost

DeepSeek R1, Qwen QwQ, Claude, GPT-5

Code Models

The Engineers

10–15% of volume

Medium cost

Qwen3-Coder, DeepSeek Coder, StarCoder

Vision / Multimodal

The Observers

5–10% of volume

Medium cost

Qwen-VL, LLaVA, InternVL

Utility / Instruction

The Workforce

60–70% of volume

Lowest cost

Qwen 2.5-7B/32B, Nemotron 3 Super, Phi-4

Embedding + Reranker

Librarians & Curators

High volume, low cost of volume

Minimal cost

Nomic Embed, BGE, sentence-transformers

Token Volume →

1. Reasoning / Synthesis Models — The Strategists

What they do: Deep thinking. Multi-step analysis. Complex document synthesis. Code architecture. Strategic planning. Anything that requires genuine reasoning — connecting ideas, evaluating tradeoffs, producing novel insight.

Examples: DeepSeek R1, Qwen QwQ, frontier models (Claude Opus, GPT-5, Gemini Ultra)

Usage profile: 5–10% of total token volume. Highest cost per token. Highest value per outcome.

These are the generals. You don't send generals to deliver mail.

Reasoning models should handle the tasks that smaller models genuinely cannot: complex multi-document synthesis, nuanced decision analysis, architectural planning, and cases where the cost of a wrong answer exceeds the cost of premium inference.

2. Utility / Instruction Models — The Workforce

What they do: Summarization. Drafting. Classification. Extraction. Translation. Routine Q&A. Data transformation. Template generation. Formatting. All the workload categories that are high-volume and well-defined.

Examples: Qwen 2.5-7B/32B, Nemotron 3 Super, Llama 3.2, Phi-4

Usage profile: 60–70% of total token volume. Lowest cost per token. The economics tier.

This is where 90% of your token budget lives. Get this tier right and the economics work. Get it wrong and no amount of routing or optimization will save you.

The key is that these models don't need to be brilliant. They need to be reliable, fast, and cheap. A 7B parameter model that summarizes emails correctly 95% of the time is more valuable to the enterprise than a frontier model that does it 98% of the time at 20x the cost.

3. Embedding Models — The Librarians

What they do: Convert text into vector representations for semantic search and retrieval. The foundation of every RAG pipeline.

Examples: Nomic Embed, GTE-large, BGE, sentence-transformers

Usage profile: High volume but low cost. These are small models (typically 100M–1B parameters) that run fast on minimal GPU. Often deployed on CPU-only infrastructure.

If you read my Autonomous Stack series, you know how important the data substrate is. Embedding models are what make that substrate searchable. They convert your documents, emails, meeting transcripts, and knowledge base into a semantic space that agents can navigate.

Every RAG query starts here. Every vector search starts here. The quality of your embedding model directly determines the quality of what your agents retrieve — and bad retrieval means wasted tokens downstream when irrelevant context pollutes the LLM's context window.

4. Reranker Models — The Curators

What they do: Re-score retrieved results for relevance before they enter the context window. The quality gate between retrieval and generation.

Examples: BGE-reranker, Cohere Rerank, cross-encoder models

Usage profile: Moderate volume, low cost. Cross-encoder architecture means they're more compute-intensive per item than embedding models, but they process far fewer items (typically top-20 to top-50 candidates from initial retrieval).

A good reranker pays for itself by keeping irrelevant chunks out of your context window.

This is a point most teams miss. Every irrelevant chunk that enters the context window costs tokens, degrades answer quality, and wastes the LLM's attention. A reranker that filters the top-50 retrieval results down to the top-5 most relevant saves tokens on every downstream inference call — and improves quality at the same time.

In MemoryOS, I use Reciprocal Rank Fusion across keyword search, vector similarity, and temporal boosting. Adding a dedicated reranker on top of that would further compress the context and improve precision. It's one of the highest-ROI additions to any RAG pipeline.

5. Code Models — The Engineers

What they do: Code generation, review, refactoring, test generation, documentation, and debugging. Specialized for programming languages and software development workflows.

Examples: Qwen3-Coder (69.6% on SWE-Bench Verified), DeepSeek Coder, StarCoder

Usage profile: High volume for development teams. Domain-specific fine-tuning pays off here — a code model trained on your internal codebase can outperform a general frontier model on your specific tasks.

Code is one of the highest-value applications of enterprise AI. Developers are expensive. Development velocity directly impacts revenue. And code models are one of the clearest cases where specialized smaller models outperform general-purpose frontier models — because the domain is well-defined and the evaluation criteria are objective.

6. Vision / Multimodal Models — The Observers

What they do: Image understanding, document OCR, diagram interpretation, chart reading, screenshot analysis, video understanding.

Examples: Qwen-VL, LLaVA, InternVL

Usage profile: Growing rapidly as agentic systems increasingly need to interact with visual interfaces — reading dashboards, processing scanned documents, understanding whiteboard photos, and navigating GUIs.

This is the category that's still maturing, but its growth trajectory is clear. As agents move from text-only interactions to full multimodal awareness, vision models become essential infrastructure.

Serving Models as Internal Services

Having the right models is necessary. Having them deployed as governed internal services is what makes it operational.

Models as Internal Services

Each model type as a governed API endpoint with OpenAI-compatible interfaces

Chat App
Agents
Code Tools
Email AI
Analytics

AI Platform Gateway

OpenAI-compatible API / routing / governance

API Keys
Usage Tracking
Governance

Reasoning

DeepSeek R1

Utility

Qwen 2.5-32B

Embedding

Nomic Embed

Reranker

BGE-reranker

Code

Qwen3-Coder

Vision

Qwen-VL

GPU Infrastructure

vLLM / NVIDIA NIM / TGI — On-Prem + Cloud + Edge

The pattern that works looks like this:

Standardize on OpenAI-compatible APIs. The major model serving frameworks — NVIDIA NIM, vLLM, TGI, SGLang — all expose OpenAI-compatible endpoints. This means every model, regardless of provider or architecture, presents the same interface to consuming applications. Swap Qwen for Nemotron behind the endpoint and no client code changes.

Each model type gets its own service endpoint. Reasoning. Utility. Embedding. Reranker. Code. Vision. Each runs as a distinct service with its own scaling policies, GPU allocation, and performance characteristics.

Each service gets governance. API keys for authentication. Rate limits to prevent runaway consumption. Cost tracking per team, per application, per use case. Usage dashboards that show who's consuming what and at what cost.

Teams consume models like any internal platform service. A product team building a customer-facing chatbot requests access to the utility tier. A data science team doing complex analysis gets access to the reasoning tier. An engineering team gets code model access. Nobody gets unrestricted access to everything — the governance layer ensures consumption matches authorization.

NVIDIA NIM is one production-ready path: it bundles TensorRT-LLM optimization into a single container, delivers 2.6x throughput improvement over vanilla H100 deployments, and supports models from Llama to Nemotron to DeepSeek.

vLLM is the open-source alternative: high-throughput inference with PagedAttention for efficient memory management and continuous batching for maximizing GPU utilization.

The organizational model is straightforward: a central AI Platform team manages the model portfolio — procurement, deployment, optimization, governance. Application teams consume via API. The platform team's KPIs are cost per token, latency per model tier, uptime, and utilization.

Frontier Models as Teachers, Not the Entire Workforce

The model portfolio pattern doesn't eliminate frontier models. It repositions them.

In the new architecture, frontier models serve three critical roles:

Distillation source. Use frontier models to generate high-quality training data and synthetic examples. Then fine-tune smaller utility models on that output. The frontier model creates the capability; the small model operationalizes it.

Evaluation oracle. Use frontier models to evaluate the outputs of smaller models. When the utility tier handles a request, a sampling-based quality check against a frontier model can catch drift and regression before they reach users.

Escalation target. The utility model handles the request by default. If confidence is low, the task is flagged as complex, or the domain is sensitive — escalate to the reasoning tier. This is the pattern AT&T deployed: "super agents" that coordinate and route, "worker agents" that execute with specialized smaller models.

The frontier model remains strategically important. It's the intelligence source, the quality benchmark, the last line of defense for hard problems.

But it no longer owns every inference path.

That's the shift. And it's what makes the economics work.

The Portfolio Is the Answer to the Economics Problem

Part 1 showed the bill: trillions of tokens per month, tens of millions of dollars per year.

The model portfolio is the answer. Not one model stretched across every workload. Six specialized tiers, each optimized for cost, quality, and purpose. Near-frontier open models handling the volume. Frontier models reserved for the work that genuinely demands them.

But having the right models isn't enough. You need the operating system around them — the routing layer that sends each request to the right tier, the caching layer that eliminates redundant computation, the optimization techniques that squeeze more throughput from every GPU.

That's the AI factory pattern. And it's Part 3.