What you need to know
- NVIDIA NIM (NVIDIA Inference Microservices) is NVIDIA's hosted API for open-weight AI models: Llama 3, Mistral, Nemotron and others, running on NVIDIA GPU infrastructure.
- Gibson built an evaluation harness for Sandbox: before any model is committed to a build, we run the task through NIM, Claude (Vertex AI) and GPT and score the outputs.
- Model selection matters for cost, data handling and task fit: not all AI tasks are best served by the same model, and the cost difference at volume can be substantial.
- Open-weight models via NIM can be self-hosted, which gives businesses with data residency requirements a path that closed-model APIs do not.
- The harness is built around the specific task, not the vendor brand. Which model ships in a finished build is a measured outcome, not a preference.
Every AI vendor says their model is best. By the time a prospect has read the benchmarks, the blog posts and the comparison pages, the models are running neck-and-neck on every dimension the vendor chose to publish. That is not useful when you are deciding which model to put at the centre of a real business process.
When Gibson builds an AI agent in Sandbox, one of the first questions is which language model to run it on. We do not answer that question from the marketing literature. We answer it by running the task through an evaluation harness that includes NVIDIA NIM, Claude via Google Cloud Vertex AI, and GPT via the OpenAI API, and scoring the outputs against the criteria that actually matter for that build: accuracy on the specific task, latency, cost per call at the expected volume, and whether the output sounds like a human from this business rather than a generic AI.
What NVIDIA NIM is
NVIDIA NIM stands for NVIDIA Inference Microservices. It is NVIDIA's hosted API service for running open-weight AI models on NVIDIA GPU infrastructure. The practical effect: you can call models like Meta's Llama 3 (8 billion and 70 billion parameter variants), Mistral, Mixtral, and NVIDIA's own Nemotron family through a standard API, without operating the GPU hardware yourself.
The difference from calling Claude or GPT is that these are open-weight models. NVIDIA does not own the weights and you are not licensing access to a proprietary model trained on NVIDIA's data. The weights are published by Meta, Mistral AI and others, and NVIDIA is providing the inference infrastructure to run them at scale with enterprise SLAs and predictable latency. For businesses that want the option to self-host, a NIM container can also be deployed on your own infrastructure, keeping data entirely on-premise.
NVIDIA NIM is essentially: take the best open-weight models from the research community, run them properly on NVIDIA GPUs, and make them available via an API call. For builders, it adds a competitive alternative to the closed-model APIs at a different cost and data-handling profile.
The models in our evaluation harness: specs at a glance
The six models we run regularly across Sandbox builds. Pricing is per million tokens, sourced from official API pricing pages as of May 2026. API prices change; verify current rates at each provider before budgeting a build.
| Model | Provider | Context window | Input / $1M tokens | Output / $1M tokens | Self-host |
|---|---|---|---|---|---|
| Claude 3.5 Haiku | Anthropic / Vertex AI | 200,000 | $0.80 | $4.00 | No |
| Claude 3.5 Sonnet | Anthropic / Vertex AI | 200,000 | $3.00 | $15.00 | No |
| GPT-4o Mini | OpenAI | 128,000 | $0.15 | $0.60 | No |
| GPT-4o | OpenAI | 128,000 | $2.50 | $10.00 | No |
| Llama 3.3 70B | NVIDIA NIM | 128,000 | from $0.59 | from $0.59 | Yes |
| Llama 3.1 8B | NVIDIA NIM | 128,000 | from $0.10 | from $0.10 | Yes |
Sources: Anthropic pricing, OpenAI pricing, NVIDIA NIM catalogue. NIM pricing varies by volume tier and model version. USD pricing; AUD equivalent depends on exchange rate at time of billing. Verify before budgeting.
What published benchmarks actually say (and what they miss)
Academic benchmarks are a starting point, not a verdict. MMLU (Massive Multitask Language Understanding) is the most cited: it tests reasoning across 57 academic subjects from law to maths. Here is how the models in our harness score on MMLU, sourced directly from each model's official model card:
| Model | MMLU score | HumanEval (code) | Source |
|---|---|---|---|
| Claude 3.5 Sonnet | 90.4% | 93.7% | Anthropic model card |
| GPT-4o | 87.2% | 90.2% | OpenAI model card |
| Llama 3.3 70B | 86.0% | 88.4% | Meta model card |
| Claude 3.5 Haiku | 83.5% | 88.1% | Anthropic model card |
| GPT-4o Mini | 82.0% | 87.2% | OpenAI model card |
| Llama 3.1 8B | 73.0% | 72.6% | Meta model card |
MMLU and HumanEval are published academic benchmarks. They measure reasoning across general knowledge and code generation respectively; they do not measure how well a model drafts a follow-up SMS to a cold real estate lead. Use them as a rough capability proxy, not as a hiring decision.
The practical implication of the benchmark data: the gap between the top closed models and Llama 3.3 70B via NIM is 4 to 5 percentage points on MMLU. For classification and extraction tasks, that gap shrinks further in task-specific evaluation because academic reasoning benchmarks do not reflect the narrower domain of a business process. The gap is more visible on HumanEval (code generation), which matters for builders more than for business owners.
Why we built an evaluation harness around it
The standard approach to building an AI agent is to pick a model upfront, usually whichever API the developer is most familiar with, and tune the prompts until it works well enough. The problem with that approach is that "well enough" is relative to the model you started with, not to the best achievable result. You do not know what you are leaving on the table.
The evaluation harness is a testing framework we built into the Sandbox build process. Before any model is committed to a build, we define a set of test cases that represent the real task: sample inputs from the business, the expected outputs, the edge cases that actually matter. We run each model on the test set and score the outputs across four dimensions.
Task accuracy. For each test case, does the model produce the right answer? Classification tasks (is this lead hot or cold) have clear right answers. Generation tasks (draft a follow-up to this call) are scored against rubrics: correct information, appropriate tone, right length, no invented details.
Latency. How long does the model take to respond at the inference endpoint? For a background task that runs overnight, latency is almost irrelevant. For a live customer-facing interaction, a two-second gap is the difference between an agent that feels instant and one that feels broken.
Cost per call at volume. API pricing varies by model and provider. At small volumes the differences are negligible. At 10,000 or 50,000 calls a month, a cost-per-call difference of a few cents compounds into a meaningful number. The harness captures the per-call cost for each model at the token counts the task actually requires, so we can project real monthly costs at the client's expected volume.
Voice and tone fit. AI output needs to sound like it came from the business, not from a generic chatbot. Models have different default registers. Some tend formal, some tend casual, some over-qualify everything. For a Sydney trades business, the right tone is different to a medical practice. The harness includes tone scoring because a technically accurate response that sounds wrong for the brand is still a bad output.
What we have found running NVIDIA NIM in the harness
A few consistent patterns have come out of running NIM alongside the closed-model APIs across different build types.
For structured extraction and classification, open-weight models are competitive. Tasks like pulling structured data from unstructured text (extract the job details from this email inquiry), classifying intent, or routing to a category: Llama 3 70B via NIM produces results close to the closed-model APIs at a different cost profile. The gap that existed a year ago has narrowed significantly.
For nuanced generation, the closed models still lead on tone. Tasks that require drafting a reply that sounds like a specific person or business, or navigating a sensitive customer situation, still tend to be better served by Claude. The instruction-following and context-retention are different. That difference is more visible in generation than in classification.
Cost at volume is where NIM becomes genuinely interesting. If a build is running a straightforward classification or extraction task at high volume, and the accuracy difference between models is small, the cost difference is not. The right choice depends on the specific numbers for that build, and the harness produces those numbers rather than relying on published pricing tables that do not reflect your actual token usage.
Data handling is a separate question from performance. For some Australian businesses, particularly in health, legal, and government-adjacent verticals, the question of where data goes is not just a cost question. An open-weight model that can be self-hosted changes the data handling picture in ways that a closed-model API cannot. NIM's container deployment option is relevant here: the same Llama 3 weights that run on NVIDIA's hosted NIM can also run in your own environment.
Cost at volume: the number that matters for AU SMBs
Per-token pricing looks trivial in isolation. It compounds fast at business scale. Here is a worked example using published pricing for a real business task: lead classification (reading an inbound enquiry and classifying it as hot, warm, or cold).
Scenario: 20,000 lead classification calls per month. Each call sends 800 tokens of input (the enquiry text plus a system prompt describing the classification rules) and receives 60 tokens of output (the classification label plus a one-line reason). Total: 16M input tokens and 1.2M output tokens per month.
| Model | Input cost | Output cost | Monthly total (USD) |
|---|---|---|---|
| Claude 3.5 Sonnet | $12.80 | $18.00 | $30.80 |
| Claude 3.5 Haiku | $12.80 | $4.80 | $17.60 |
| GPT-4o | $40.00 | $12.00 | $52.00 |
| GPT-4o Mini | $2.40 | $0.72 | $3.12 |
| Llama 3.3 70B (NIM) | $9.44 | $0.71 | $10.15 |
| Llama 3.1 8B (NIM) | $1.60 | $0.12 | $1.72 |
Based on published API pricing as of May 2026. NIM pricing uses the published per-token rate for hosted Llama 3.3 70B and Llama 3.1 8B. USD. Verify current rates before budgeting any build. Exchange rate not included.
The range across models for the same task and volume: USD $1.72 to $52.00 per month. For a task where accuracy is comparable across the mid-tier models (and for classification it often is), the question of which model to run is a cost engineering decision, not just a quality decision. At 200,000 calls per month the spread widens by a factor of ten. That is what the harness surfaces: not just which model is most accurate, but which model is most accurate at the cost that fits the build.
Task-fit matrix: what our evaluation shows
No single model wins every task. Here is how the models in our harness tend to perform across the task types that come up most in AU SMB builds. These are our observed outcomes from running the evaluation harness, not vendor claims.
| Task type | Our harness pick | Why |
|---|---|---|
| Lead classification at volume | Llama 3.3 70B (NIM) or GPT-4o Mini | Accuracy is competitive with the premium models on structured output. Cost difference at 20k+ calls/month is material. |
| Customer reply drafting (SMS, email) | Claude 3.5 Haiku | Instruction-following and tone control are ahead of the field. Stays on brand without heavy prompt engineering. |
| Long document summarisation | Claude 3.5 Sonnet | 200,000 token context window handles full transcripts, contracts, and call logs without chunking. Coherence across long inputs is noticeably stronger. |
| Structured data extraction from free text | Llama 3.3 70B (NIM) or GPT-4o Mini | JSON and structured output modes are reliable. Extraction accuracy on defined schemas is close to the premium models. |
| Regulated or sensitive industries (health, legal, govt) | Llama (self-hosted via NIM) | Self-hosted NIM keeps data on your own infrastructure. No data leaves the building. Model performance trades off slightly against the hosted APIs. |
| Real-time customer-facing interaction | GPT-4o Mini or Claude 3.5 Haiku | Consistent sub-second response times via their hosted APIs. Latency on NIM hosted tier is competitive but varies by plan. |
| Code generation and AI-assisted tooling | Claude 3.5 Sonnet | HumanEval 93.7% is the highest in our harness. For Sandbox builds that include generated code or workflow automation, the accuracy gap is measurable. |
These are observations from running our evaluation harness across Sandbox builds. Task-specific results vary. The right model for your build depends on your actual test cases, not this table.
What this means if you are commissioning an AI build
You do not need to know which model is right for your use case before you talk to us. That is exactly what the evaluation harness produces. The model selection is a build output, not a build input.
What is worth understanding as a buyer is that model selection is a real decision with real consequences for performance and cost, and the right answer varies by task. A vendor who recommends the same model for every build is optimising for their workflow, not your outcome. The harness exists because we wanted to be able to tell a client, with evidence, why we made the selection we made.
If your build involves high call volumes, specific data handling requirements, or tasks where cost compounds quickly, the question of which model to run is worth asking explicitly. The evaluation adds time at the start of a build. It saves time later when you are not re-platforming because the initial choice turned out to be wrong for your volume or your industry.
The NVIDIA angle for Australian SMBs
NVIDIA's name in the AI conversation is usually about hardware: the H100 chips that power every major AI lab. The NIM inference API is a different product, targeted at developers who want to run open-weight models without operating GPU clusters. For Australian businesses, it is one of the more interesting additions to the model landscape in the past 12 months, specifically because it makes open-weight models accessible at the same friction level as the closed-model APIs.
The practical implication: the choice of AI model for your business is no longer just "which version of GPT" or "Claude versus GPT." Open-weight models running on NVIDIA infrastructure are a credible option for a range of commercial tasks, with a different cost and data-handling profile, and a path to self-hosting that closed models cannot offer.
We run NIM in the harness because the evaluation should include all credible options, not just the ones with the biggest marketing budgets. Whether it wins for your specific build depends on the task. That is what the harness is for.
If you want to see how it plays out for your use case, brief the Sandbox team or read through what we have already built at Gibson Sandbox.
Frequently asked questions
What is NVIDIA NIM?
NVIDIA NIM (NVIDIA Inference Microservices) is NVIDIA's hosted API for running open-weight AI models on NVIDIA GPU infrastructure. It gives developers access to models like Meta Llama 3, Mistral, and NVIDIA's own Nemotron family via a standard API call, similar to calling OpenAI or Anthropic, but for open-weight models with enterprise SLAs and NVIDIA-grade inference speed.
What is an LLM evaluation harness?
An LLM evaluation harness is a test framework that runs the same task through multiple AI models and scores the outputs against defined criteria: accuracy, latency, cost per call, and fit for the specific use case. Before Gibson commits to a model for any Sandbox build, the task goes through the harness so the selection is based on measured results, not vendor marketing.
Why does model selection matter for a small business?
The performance difference between AI models on a specific task can be significant, and so can the cost. A model that charges per token and handles 50,000 calls a month at a higher rate than an open-weight alternative may cost ten times as much for the same quality outcome. The right model for drafting customer emails is not necessarily the right model for classifying inbound leads. Model selection is a build decision, not a brand preference.
Does Gibson use NVIDIA NIM for client builds?
Gibson's Sandbox evaluation harness includes NVIDIA NIM alongside Claude (via Google Cloud Vertex AI) and GPT (via OpenAI API). Which model ships in a finished build depends on the task, the data handling requirements, and the measured results for that specific use case. We do not have a house model; we have a testing process.
What open-weight models are available via NVIDIA NIM?
The NVIDIA NIM catalogue includes Meta Llama 3 (8B and 70B variants), Mistral and Mixtral, NVIDIA Nemotron, Microsoft Phi, and others. The catalogue expands regularly. For AU SMBs, the practical interest is usually Llama 3 70B (strong general performance at competitive cost) and Mistral for structured extraction tasks.
How does this affect data handling for Australian businesses?
Open-weight models via NVIDIA NIM can be self-hosted, which matters for businesses with strict data residency requirements. A locally-deployed NIM container keeps data on your own infrastructure entirely. Hosted NIM runs on NVIDIA's cloud. Either way, you own the model weights: a different posture to sending data to a closed-model API. For industries with specific obligations (health, legal, government) this distinction is worth raising with your privacy or compliance adviser before choosing a model.



