kovan labs · april 2026
the models eating the world quietly
sanju
thisux · unitedby.ai
how many of you have ChatGPT open right now?
be honest. raise your hand.
now, would you put your company's private data into that same tab?
(fewer hands. usually none.)
"
that gap is what this talk is about.
90 minutes. let's close it.
the trillion-parameter hangover
for two years everyone believed this:
bigger
= better
70B beats 7B. 700B beats 70B. bigger is always best.
that
broke.
here is the proof
Phi-4
3.8B params
Microsoft's tiny model
beats
on math and science benchmarks
GPT-4o
OpenAI's giant
100x bigger model
so what is a small language model?
"can a normal human run this on a machine they already own?"
if yes, it is an SLM.
four reasons SLMs are winning right now
cheap
cloud AI costs pile up fast. local runs free per query.
private
your data never leaves your machine. not ever.
fast
local gives you 50ms. cloud adds network round-trip on top.
yours
tune it, ship it, own it. the model belongs to you.
"
a sharp junior trained on your exact problem beats a genius who doesn't know your domain.
that junior is an SLM.
the SLM landscape in 2026
four model families worth knowing
Phi-4
3.8B
reasoning king
Gemma 3
2B to 27B
multilingual
Qwen 3
0.5B to 235B
dark horse
Llama 3.3
1B to 8B
best ecosystem
the "train on better data" model
superpower
reasoning, math, code. thinks before it answers.
how it was trained
curated synthetic data, not raw internet scrapes.
use it when
you need the model to actually think through a hard problem.
boring in a good way. it just works.
multilingual
works well across many languages. safety features built in.
Gemma 3n
designed to run on 4GB RAM. built for phones.
use it when
building multilingual products or selling to enterprise.
the dark horse nobody expected
Qwen 3.5 at 0.8B
under one billion params. runs on a phone. still useful for real tasks. wild.
use it when
you need the cheapest inference or strong Asian language support.
the one with the biggest ecosystem
best tool calling
strongest at deciding when and how to call APIs. great for agents.
edge-tuned
1B and 3B built specifically for phones and laptops.
community
most fine-tuned variants and the widest framework support.
also worth knowing
SmolLM3 HuggingFace · 3B
fully open recipe. training data is published. you can see exactly what went into it.
Ministral 3 Mistral · 3B
tiny model with vision built in. runs on a single GPU.
gpt-oss-20b OpenAI
yes, OpenAI shipped an open-weight model. fine-tunable on one GPU.
Granite 4 IBM
enterprise-grade, boring, reliable. perfect for banks and governments.
"
two years ago, running AI on your laptop was a party trick.
today it is production.
the ceiling has not stopped moving.
ask this before anything else
you have four options. try them in this order.
prompt engineering
write better prompts. costs hours. always try this first.
RAG
connect the model to your docs. great for answering questions from your knowledge base.
fine-tuning ← we are here today
permanently change how the model behaves. a weekend and about $10.
train from scratch
just don't. seriously.
fine-tune when you need any of these:
consistent style
"always sound like our brand" needs about 500 to 1000 examples.
exact output format
"always return this exact JSON schema" needs about 500 examples.
domain expertise
"know our product inside out" needs 1000 to 5000 examples.
replace a big model
a fine-tuned 3B often replaces a prompted 70B. huge cost saving.
three times fine-tuning is the wrong choice:
"I want the model to know what happened yesterday"
that is RAG, not fine-tuning. knowledge goes into retrieval.
"my prompt is 2000 tokens, I want it shorter"
try prompt engineering first. you probably haven't exhausted it.
"I have 50 examples and a dream"
not enough data. use few-shot prompting instead.
"
500 great examples beat 5,000 bad ones.
every. single. time.
dataset quality is the whole game.
this is the part that makes everything else click.
a bit is the smallest piece of information a computer can store.
just 0 or 1.
off or on. that is it.
inside every AI model there are billions of numbers.
0.83741
each one is called a weight.
a 7B model has 7 billion of these numbers.
how does a computer store that number? as bits.
more bits = more precise number
more bits = more memory
the whole story of model compression is about trading one for the other.
32 bits per number.
like writing pi as 3.14159265. very precise.
4 bytes
per number
28 GB
for a 7B model
16 bits per number.
like writing pi as 3.14. close enough for most tasks.
2 bytes
per number
14 GB
half of FP32
8 bits per number.
like writing pi as 3. rough but surprisingly still works.
1 byte
per number
7 GB
for a 7B model
4 bits. only 16 possible values.
this is what the QLoRA trick uses under the hood.
0.5 bytes
per number
3.5 GB
for a 7B model
1 bit. just -1 or +1.
two possible values. PrismML made this work at 8B scale.
0.125 bytes
per number
875 MB
for an 8B model
one 7B model. different precisions. wildly different sizes.
"
you do not need perfect. you need good enough.
INT4 is good enough for most real tasks.
precision loss is real. impact is usually tiny.
no code. just the mental model.
training a model means nudging billions of numbers until predictions get better.
fine-tuning is the same thing.
but you start from a model that is already good.
show it your examples. the weights shift toward your task.
the problem: fully updating all 7B weights is expensive.
80GB
GPU memory needed
8×
A100 GPUs required
$$$
in cloud costs
then came LoRA.
a Microsoft team noticed something in 2021:
when you fine-tune, most weight changes look like a few repeating patterns.
you do not need to update 7 billion weights.
just two small matrices.
the LoRA trick, visually
W
frozen
7B
+
A
train
×
B train
7M
params trained instead of 7 billion
95 to 100%
quality of a full fine-tune
bonus: the trained part is tiny.
one base model.
hundreds of 20MB adapters.
swap adapters at runtime. one model, infinite personalities.
LoRA made fine-tuning cheap.
QLoRA made it basically free.
quantize the base model to INT4, then attach a LoRA adapter on top.
before QLoRA
70B fine-tune
8 A100s. thousands of dollars. days of compute.
with QLoRA
70B fine-tune
one consumer GPU. $20 electricity. a weekend.
the library making this possible
plugs into PyTorch and HuggingFace. almost zero code changes needed.
NF4
smart 4-bit format designed for how neural weights are actually distributed.
double quantization
quantizes the quantization constants too. saves another 0.37 bits per param.
paged optimizers
memory spikes page to CPU like an OS. no more out-of-memory crashes.
where you actually do this
the GitHub of AI
the hub
every model, every dataset. filter by size, task, and license.
spaces
deploy a Gradio demo in 5 minutes. free tier available.
inference endpoints
your fine-tuned model on a dedicated GPU with a real API.
AutoTrain
no-code fine-tuning. upload dataset, pick model, click run.
real GPU rental prices in 2026, per hour
Vast.ai
$0.29
RunPod
$0.34
AWS
$1.26
Vast.ai
$0.75
RunPod
$1.39
AWS
$3.67
Vast.ai
$1.87
RunPod
$1.99
AWS
$6.88
AWS is 3 to 6 times pricier. avoid it for experiments.
what it actually costs to fine-tune
3B model · 1000 examples · QLoRA
$0.68
less than a coffee
7B model · 5000 examples · QLoRA
$4.50
a coffee and a snack
70B model · 10000 examples · QLoRA
$28
a nice dinner
7B model · full fine-tune · all weights
$56
a pizza party
"
you can fine-tune a frontier-class model for less than a pizza.
this was not true 18 months ago.
the most important cost fact in this entire talk.
the cheat code
rewrites the fine-tuning stack at the lowest level. same result. 2x faster. 70% less memory. free.
honest recommendation if you are just starting:
open an Unsloth notebook for your model, swap in your dataset, hit run all.
that is the whole workflow.
stretch, grab water, come back.
coming up: something that feels like science fiction.
Ternary Bonsai · PrismML · April 2026
we went from 32 bits, to 16, to 8, to 4.
what if we went to 1?
that sounds insane. until a few months ago, it was insane.
every single weight is -1 or +1. nothing in between.
8B
parameters
1.15 GB
total memory
9 to 10x
smaller than FP16
and the part that broke everyone's brains: it worked. shockingly close to a normal 8B model.
what if each weight had THREE possible values?
-1
0
+1
that is 1.58 bits.
log2 of 3 equals 1.58. three values need 1.58 bits to represent.
benchmark score
75.5
vs 70.5 for 1-bit. a 5-point jump.
memory needed
1.75 GB
vs 16.38 GB for Qwen3 8B at the same score
same model class. same benchmarks.
nine times less memory.
on an M4 Pro MacBook:
82
tokens per second
that is 5 times faster than a normal FP16 8B model.
iPhone 17 Pro Max: 27 tokens per second. locally. on battery.
why it is so fast:
ternary math is basically just addition.
-1
subtract
0
do nothing
+1
add
no floating point math. no multiplications. future chips will be built specifically for this.
"
the phone in your pocket will run GPT-4-level AI locally in 18 months.
no cloud. no battery drain. no data sharing.
PrismML shipped it two days ago. it runs today.
available today · Apache 2.0 · use commercially
1.7B
4B
8B
hf.co/collections/prism-ml/ternary-bonsai
runs on Apple devices via MLX. WebGPU demo works in-browser right now.
from problem to deployed model
define the problem narrowly
"support email into JSON with intent and urgency" not "make a support bot"
get 500 hand-curated examples
quality over quantity. always.
pick your base model
QLoRA on Unsloth, rent an A100
open notebook, swap dataset, run all, walk away
evaluate: task metric plus regression check
deploy: serverless, self-hosted, or on-device
iterate. this is where most teams quit too early.
step 3: pick your base model
need reasoning?
Phi-4 or Qwen 3
multilingual?
Gemma 3 or Qwen 3
tool calling?
Llama 3.2
smallest possible?
Qwen 3.5 at 0.8B
bleeding edge?
Ternary Bonsai
two years ago
AI meant sending your data to OpenAI and hoping.
then
today
your model. your hardware. your data. your cost.
the biggest shift in applied AI since transformers. most people have not noticed yet.
watch these three things in the next 12 months
sub-2-bit models become the norm
what PrismML is doing will spread to every major lab.
new chips built for ternary math
Apple is already hinting. ternary math is addition. orders of magnitude faster.
task-specific SLMs replace cloud LLM calls
Gartner says 3x by 2027. I think they are being conservative.
"
you do not need a frontier model.
you need yours.
take these with you
Ollama
run any SLM locally in one command
ollama.com
Hugging Face
find every model and dataset
huggingface.co
Unsloth
fastest way to fine-tune
unsloth.ai
Ternary Bonsai
the bleeding edge, today
hf.co/prism-ml/ternary-bonsai
ask anything. nothing is a dumb question.
sanju
thisux.com · unitedby.ai · sanju.sh
kovan labs · april 2026