1 / 1

kovan labs  ·  april 2026

small language
models

the models eating the world quietly

sanju

thisux  ·  unitedby.ai

how many of you have ChatGPT open right now?

be honest. raise your hand.

now, would you put your company's private data into that same tab?

(fewer hands. usually none.)

"

that gap is what this talk is about.

90 minutes. let's close it.

section 01

why small?
why now?

the trillion-parameter hangover

for two years everyone believed this:

bigger
= better

70B beats 7B. 700B beats 70B. bigger is always best.

that
broke.

here is the proof

Phi-4

3.8B params

Microsoft's tiny model

beats

on math and science benchmarks

GPT-4o

OpenAI's giant

100x bigger model

so what is a small language model?

"can a normal human run this on a machine they already own?"

if yes, it is an SLM.

four reasons SLMs are winning right now

cheap

cloud AI costs pile up fast. local runs free per query.

private

your data never leaves your machine. not ever.

fast

local gives you 50ms. cloud adds network round-trip on top.

yours

tune it, ship it, own it. the model belongs to you.

"

a sharp junior trained on your exact problem beats a genius who doesn't know your domain.

that junior is an SLM.

section 02

who is actually
good right now?

the SLM landscape in 2026

four model families worth knowing

Phi-4

3.8B

reasoning king

Gemma 3

2B to 27B

multilingual

Qwen 3

0.5B to 235B

dark horse

Llama 3.3

1B to 8B

best ecosystem

Microsoft 3.8B parameters

Phi-4

the "train on better data" model

superpower

reasoning, math, code. thinks before it answers.

how it was trained

curated synthetic data, not raw internet scrapes.

use it when

you need the model to actually think through a hard problem.

Google 2B to 27B

Gemma 3

boring in a good way. it just works.

multilingual

works well across many languages. safety features built in.

Gemma 3n

designed to run on 4GB RAM. built for phones.

use it when

building multilingual products or selling to enterprise.

Alibaba 0.5B to 235B  ·  Apache 2.0

Qwen 3

the dark horse nobody expected

Qwen 3.5 at 0.8B

under one billion params. runs on a phone. still useful for real tasks. wild.

use it when

you need the cheapest inference or strong Asian language support.

Meta 1B, 3B, 8B

Llama 3.3

the one with the biggest ecosystem

best tool calling

strongest at deciding when and how to call APIs. great for agents.

edge-tuned

1B and 3B built specifically for phones and laptops.

community

most fine-tuned variants and the widest framework support.

also worth knowing

SmolLM3  HuggingFace  ·  3B

fully open recipe. training data is published. you can see exactly what went into it.

Ministral 3  Mistral  ·  3B

tiny model with vision built in. runs on a single GPU.

gpt-oss-20b  OpenAI

yes, OpenAI shipped an open-weight model. fine-tunable on one GPU.

Granite 4  IBM

enterprise-grade, boring, reliable. perfect for banks and governments.

"

two years ago, running AI on your laptop was a party trick.

today it is production.

the ceiling has not stopped moving.

section 03

do you actually
need to fine-tune?

ask this before anything else

you have four options. try them in this order.

1

prompt engineering

write better prompts. costs hours. always try this first.

2

RAG

connect the model to your docs. great for answering questions from your knowledge base.

3

fine-tuning  ← we are here today

permanently change how the model behaves. a weekend and about $10.

4

train from scratch

just don't. seriously.

fine-tune when you need any of these:

consistent style

"always sound like our brand" needs about 500 to 1000 examples.

exact output format

"always return this exact JSON schema" needs about 500 examples.

domain expertise

"know our product inside out" needs 1000 to 5000 examples.

replace a big model

a fine-tuned 3B often replaces a prompted 70B. huge cost saving.

three times fine-tuning is the wrong choice:

"I want the model to know what happened yesterday"

that is RAG, not fine-tuning. knowledge goes into retrieval.

"my prompt is 2000 tokens, I want it shorter"

try prompt engineering first. you probably haven't exhausted it.

"I have 50 examples and a dream"

not enough data. use few-shot prompting instead.

"

500 great examples beat 5,000 bad ones.

every. single. time.

dataset quality is the whole game.

section 04

let's talk
about bits.

this is the part that makes everything else click.

a bit is the smallest piece of information a computer can store.

1

just 0 or 1.

off or on. that is it.

inside every AI model there are billions of numbers.

0.83741

each one is called a weight.

a 7B model has 7 billion of these numbers.

how does a computer store that number? as bits.

more bits = more precise number

more bits = more memory

the whole story of model compression is about trading one for the other.

FP32

32 bits per number.

like writing pi as 3.14159265. very precise.

4 bytes

per number

28 GB

for a 7B model

FP16

16 bits per number.

like writing pi as 3.14. close enough for most tasks.

2 bytes

per number

14 GB

half of FP32

INT8

8 bits per number.

like writing pi as 3. rough but surprisingly still works.

1 byte

per number

7 GB

for a 7B model

INT4

4 bits. only 16 possible values.

this is what the QLoRA trick uses under the hood.

0.5 bytes

per number

3.5 GB

for a 7B model

1-bit

1 bit. just -1 or +1.

two possible values. PrismML made this work at 8B scale.

0.125 bytes

per number

875 MB

for an 8B model

one 7B model. different precisions. wildly different sizes.

FP32  32 bits 28 GB
FP16  16 bits 14 GB
INT8  8 bits 7 GB
INT4  4 bits 3.5 GB
1-bit Bonsai  8B model 875 MB 🤯

"

you do not need perfect. you need good enough.

INT4 is good enough for most real tasks.

precision loss is real. impact is usually tiny.

section 05

how fine-tuning
actually works

no code. just the mental model.

training a model means nudging billions of numbers until predictions get better.

fine-tuning is the same thing.

but you start from a model that is already good.

show it your examples. the weights shift toward your task.

the problem: fully updating all 7B weights is expensive.

80GB

GPU memory needed

A100 GPUs required

$$$

in cloud costs

then came LoRA.

a Microsoft team noticed something in 2021:

when you fine-tune, most weight changes look like a few repeating patterns.

you do not need to update 7 billion weights.

just two small matrices.

the LoRA trick, visually

W

frozen
7B

+

A

train

×

B   train

7M

params trained instead of 7 billion

95 to 100%

quality of a full fine-tune

bonus: the trained part is tiny.

one base model.

hundreds of 20MB adapters.

customer support code assistant legal drafting summarization your use case

swap adapters at runtime. one model, infinite personalities.

LoRA made fine-tuning cheap.

QLoRA made it basically free.

quantize the base model to INT4, then attach a LoRA adapter on top.

before QLoRA

70B fine-tune

8 A100s. thousands of dollars. days of compute.

with QLoRA

70B fine-tune

one consumer GPU. $20 electricity. a weekend.

the library making this possible

bitsandbytes

plugs into PyTorch and HuggingFace. almost zero code changes needed.

NF4

smart 4-bit format designed for how neural weights are actually distributed.

double quantization

quantizes the quantization constants too. saves another 0.37 bits per param.

paged optimizers

memory spikes page to CPU like an OS. no more out-of-memory crashes.

section 06

the toolchain

where you actually do this

the GitHub of AI

Hugging Face

the hub

every model, every dataset. filter by size, task, and license.

spaces

deploy a Gradio demo in 5 minutes. free tier available.

inference endpoints

your fine-tuned model on a dedicated GPU with a real API.

AutoTrain

no-code fine-tuning. upload dataset, pick model, click run.

real GPU rental prices in 2026, per hour

RTX 4090  24GB

Vast.ai

$0.29

RunPod

$0.34

AWS

$1.26

A100 40GB

Vast.ai

$0.75

RunPod

$1.39

AWS

$3.67

H100 80GB

Vast.ai

$1.87

RunPod

$1.99

AWS

$6.88

AWS is 3 to 6 times pricier. avoid it for experiments.

what it actually costs to fine-tune

3B model · 1000 examples · QLoRA

$0.68

less than a coffee

7B model · 5000 examples · QLoRA

$4.50

a coffee and a snack

70B model · 10000 examples · QLoRA

$28

a nice dinner

7B model · full fine-tune · all weights

$56

a pizza party

"

you can fine-tune a frontier-class model for less than a pizza.

this was not true 18 months ago.

the most important cost fact in this entire talk.

the cheat code

Unsloth

rewrites the fine-tuning stack at the lowest level. same result. 2x faster. 70% less memory. free.

honest recommendation if you are just starting:

open an Unsloth notebook for your model, swap in your dataset, hit run all.

that is the whole workflow.

5-minute break

stretch, grab water, come back.

coming up: something that feels like science fiction.

section 07

the bleeding edge

Ternary Bonsai  ·  PrismML  ·  April 2026

we went from 32 bits, to 16, to 8, to 4.

what if we went to 1?

that sounds insane. until a few months ago, it was insane.

PrismML  ·  March 2026

1-bit Bonsai

every single weight is -1 or +1. nothing in between.

8B

parameters

1.15 GB

total memory

9 to 10x

smaller than FP16

and the part that broke everyone's brains: it worked. shockingly close to a normal 8B model.

PrismML  ·  April 2026  ·  2 days ago

what if each weight had THREE possible values?

-1

0

+1

that is 1.58 bits.

log2 of 3 equals 1.58. three values need 1.58 bits to represent.

Ternary Bonsai 8B

benchmark score

75.5

vs 70.5 for 1-bit. a 5-point jump.

memory needed

1.75 GB

vs 16.38 GB for Qwen3 8B at the same score

same model class. same benchmarks.

nine times less memory.

on an M4 Pro MacBook:

82

tokens per second

that is 5 times faster than a normal FP16 8B model.

iPhone 17 Pro Max: 27 tokens per second. locally. on battery.

why it is so fast:

ternary math is basically just addition.

-1

subtract

0

do nothing

+1

add

no floating point math. no multiplications. future chips will be built specifically for this.

"

the phone in your pocket will run GPT-4-level AI locally in 18 months.

no cloud. no battery drain. no data sharing.

PrismML shipped it two days ago. it runs today.

available today  ·  Apache 2.0  ·  use commercially

1.7B

4B

8B

hf.co/collections/prism-ml/ternary-bonsai

runs on Apple devices via MLX. WebGPU demo works in-browser right now.

section 08

the real workflow

from problem to deployed model

1

define the problem narrowly

"support email into JSON with intent and urgency" not "make a support bot"

2

get 500 hand-curated examples

quality over quantity. always.

3

pick your base model

4

QLoRA on Unsloth, rent an A100

open notebook, swap dataset, run all, walk away

5

evaluate: task metric plus regression check

6

deploy: serverless, self-hosted, or on-device

7

iterate. this is where most teams quit too early.

step 3: pick your base model

need reasoning?

Phi-4 or Qwen 3

multilingual?

Gemma 3 or Qwen 3

tool calling?

Llama 3.2

smallest possible?

Qwen 3.5 at 0.8B

bleeding edge?

Ternary Bonsai

two years ago

AI meant sending your data to OpenAI and hoping.

then

today

your model. your hardware. your data. your cost.

the biggest shift in applied AI since transformers. most people have not noticed yet.

watch these three things in the next 12 months

sub-2-bit models become the norm

what PrismML is doing will spread to every major lab.

new chips built for ternary math

Apple is already hinting. ternary math is addition. orders of magnitude faster.

task-specific SLMs replace cloud LLM calls

Gartner says 3x by 2027. I think they are being conservative.

"

you do not need a frontier model.

you need yours.

take these with you

Ollama

run any SLM locally in one command

ollama.com

Hugging Face

find every model and dataset

huggingface.co

Unsloth

fastest way to fine-tune

unsloth.ai

Ternary Bonsai

the bleeding edge, today

hf.co/prism-ml/ternary-bonsai

Q&A

ask anything. nothing is a dumb question.

sanju

thisux.com  ·  unitedby.ai  ·  sanju.sh

kovan labs  ·  april 2026