How to do simple AI text analysis? LLM, fine-tuning, BERT, or embeddings?

Decades ago, natural-language processing was mostly about specialized models for relatively simple tasks. They were small, cheap to run, and you had to know what you were doing. Today we have large language models that can follow instructions and often do the same thing without training your own model.

Let's use sentiment as the lab task. I take a short social-media post and want to put it into one of three classes: positive, negative, neutral. This is not meant as an academic benchmark. It is the practical question: what should I use, how good will it be, how fast will it be, and how much will it cost?

Introduction to the comparison

For the same classification task I have four practical paths. The next cards walk through each one, but here is a quick preview of what I will be comparing.

I instruct the model to return one class. I try almost-zero-shot and few-shot with 100 or 1000 examples. This is the simplest variant, but cost and latency grow with context length.

Best accuracy

fine-tuned GPT-4.1-mini, 80.0%.

Lowest cost

embeddings + logistic regression, about $0.0005 per 1000 samples.

Lowest latency

BERT, about 0.09 seconds per sample.

Easiest changes

plain LLM, because I change the prompt, not the model.

The accuracy may not look amazing at first glance. Part of the reason is that the boundary between positive and neutral sentiment is often fuzzy. Some posts are simply labeled differently by different people.

Classification quality

Group	Experiment	Model / approach	Strategy	Accuracy	Run cost	Time	Samples/s
LLM	LLM_NANO_ZEROSHOT	GPT-4.1-nano	Zero-shot	67.7%	$0.13	3306.7 s	1.56
LLM	LLM_NANO_FEWSHOT_100	GPT-4.1-nano	Few-shot 100	68.6%	$1.45 (~$0.87 cached)	4215.4 s	1.23
LLM	LLM_NANO_FEWSHOT_1000	GPT-4.1-nano	Few-shot 1000	68.4%	$13.39 (~$8.03 cached)	9160.4 s	0.56
LLM	LLM_MINI_ZEROSHOT	GPT-4.1-mini	Zero-shot	69.3%	$0.55	3919.9 s	1.38
LLM	LLM_MINI_FEWSHOT_100	GPT-4.1-mini	Few-shot 100	69.4%	$5.88 (~$3.53 cached)	5565.6 s	0.94
LLM	LLM_MINI_FEWSHOT_1000	GPT-4.1-mini	Few-shot 1000	68.4%	$55.53 (~$33.32 cached)	16084.1 s	0.33
Fine-tuned LLM	LLM_FT_NANO_ZEROSHOT	GPT-4.1-nano FT	Zero-shot	79.0%	$0.26	1907.0 s	2.71
Fine-tuned LLM	LLM_FT_NANO_FEWSHOT_100	GPT-4.1-nano FT	Few-shot 100	78.4%	$2.90 (~$1.74 cached)	1940.0 s	2.66
Fine-tuned LLM	LLM_FT_MINI_ZEROSHOT	GPT-4.1-mini FT	Zero-shot	80.0%	$1.04	1564.2 s	3.30
Fine-tuned LLM	LLM_FT_MINI_FEWSHOT_100	GPT-4.1-mini FT	Few-shot 100	79.3%	$11.58 (~$6.95 cached)	1918.4 s	2.69
Embeddings + ML	LR_100	OpenAI Embeddings + LR	100 samples	60.9%	$0.0025	963.2 s	5.40
Embeddings + ML	LR_1000	OpenAI Embeddings + LR	1000 samples	67.1%	$0.0025	951.7 s	5.47
Embeddings + ML	LR_ALL	OpenAI Embeddings + LR	full dataset	73.2%	$0.0025	932.1 s	5.58
Transformer encoder	BERT_SENTIMENT	BERT-base FT	full dataset	74.5%	$0.028	465.4 s	11.19

Experiment	Completed	Failed	Total tokens	Input tokens	Output tokens
LLM_NANO_ZEROSHOT	5166 (99.3%)	39 (0.7%)	1,289,946	1,284,780	5166
LLM_NANO_FEWSHOT_100	5165 (99.2%)	40 (0.8%)	14,470,776	14,465,611	5165
LLM_NANO_FEWSHOT_1000	5166 (99.3%)	39 (0.7%)	133,890,834	133,885,668	5166
LLM_MINI_ZEROSHOT	5076 (97.5%)	129 (2.5%)	1,351,565	1,346,144	5421
LLM_MINI_FEWSHOT_100	5140 (98.8%)	65 (1.2%)	14,688,863	14,683,620	5243
LLM_MINI_FEWSHOT_1000	5088 (97.8%)	117 (2.2%)	138,814,519	138,809,163	5356
LLM_FT_NANO_ZEROSHOT	5166 (99.3%)	39 (0.7%)	1,289,946	1,284,780	5166
LLM_FT_NANO_FEWSHOT_100	5166 (99.3%)	39 (0.7%)	14,473,578	14,468,412	5166
LLM_FT_MINI_ZEROSHOT	5166 (99.3%)	39 (0.7%)	1,289,946	1,284,780	5166
LLM_FT_MINI_FEWSHOT_100	5165 (99.2%)	40 (0.8%)	14,470,685	14,465,520	5165
LR_100	5205 (100%)	0	124,123	124,123	0
LR_1000	5205 (100%)	0	124,123	124,123	0
LR_ALL	5205 (100%)	0	124,123	124,123	0
BERT_SENTIMENT	5205 (100%)	0	N/A	N/A	N/A

Cost per 1000 samples

Category	Experiment	Cost per 1000 samples	Rank
Cheapest	LR_100	$0.0005	1
Cheapest	LR_1000	$0.0005	1
Cheapest	LR_ALL	$0.0005	1
Low	LLM_NANO_ZEROSHOT	$0.025	4
Low	BERT_SENTIMENT	$0.028	5
Low	LLM_FT_NANO_ZEROSHOT	$0.050	6
Low	LLM_MINI_ZEROSHOT	$0.106	7
Low	LLM_FT_MINI_ZEROSHOT	$0.202	8
Medium	LLM_NANO_FEWSHOT_100	$0.280 ($0.168 cached)	9
Medium	LLM_FT_NANO_FEWSHOT_100	$0.562 ($0.337 cached)	10
Medium	LLM_MINI_FEWSHOT_100	$1.130 ($0.678 cached)	11
Medium	LLM_FT_MINI_FEWSHOT_100	$2.242 ($1.345 cached)	12
Expensive	LLM_NANO_FEWSHOT_1000	$2.590 ($1.554 cached)	13
Expensive	LLM_MINI_FEWSHOT_1000	$10.670 ($6.402 cached)	14

Latency

Category	Experiment	Time per sample
Fastest	BERT_SENTIMENT	0.09 s
Fast	LR_ALL	0.18 s
Fast	LR_1000	0.18 s
Fast	LR_100	0.19 s
Medium	LLM_FT_MINI_ZEROSHOT	0.30 s
Medium	LLM_FT_MINI_FEWSHOT_100	0.37 s
Medium	LLM_FT_NANO_ZEROSHOT	0.37 s
Medium	LLM_FT_NANO_FEWSHOT_100	0.38 s
Slow	LLM_NANO_ZEROSHOT	0.64 s
Slow	LLM_MINI_ZEROSHOT	0.72 s
Slow	LLM_NANO_FEWSHOT_100	0.81 s
Slow	LLM_MINI_FEWSHOT_100	1.06 s
Outlier	LLM_NANO_FEWSHOT_1000	1.77 s
Outlier	LLM_MINI_FEWSHOT_1000	3.00 s

Four options in depth

An LLM can follow instructions, so we can simply tell it: classify this text and return only one output token. In my case 0, 1, or 2. If I let the model think and explain at length it might be more accurate, but also many times more expensive. I care about practical operation here.

What we saw with the LLM

Click a card for the detail

QualityMini is a bit better than nano, but not dramatically.

Few-shot with 100 examples helps a little. Few-shot with 1000 examples actually confuses small models and stretches context so much that the result stops making sense.

CostInput tokens dominate.

That is why few-shot 1000 is practically nonsense: dramatically more expensive and worse. Input-token caching helps for few-shot scenarios, but it does not change the rules.

LatencyShort context is faster; nano is faster.

Sequential processing comes out at roughly 0.64 to 1.06 seconds per sample. The 1000-example variant is completely off, three seconds per sample with the mini model.

FlexibilityThis is where LLM wins.

No training, no pipeline, no specialized model. I change the prompt and get different behavior. One endpoint, one library, one runtime model.

Here I again take the mini and nano models and show them 25,000 training examples. Then I use them zero-shot and with few-shot 100 examples.

Fine-tuned LLM in practice

Click a card for the detail

QualityBest result of the whole test.

Fine-tuning helped dramatically. GPT-4.1-mini reached 80.0%. Interestingly, few-shot stops helping and actually hurts. My guess is that 100 randomly picked examples are not representative against the 25,000 used in training.

CostInferencing is not the whole story.

Fine-tuning itself cost $40. In Azure OpenAI Service the fine-tuned model uses the same token prices as the regular version, but hosting is billed at $1.7/h. With OpenAI you do not pay hosting, but tokens are more expensive. Sometimes one is cheaper, sometimes the other.

LatencyMarkedly better in my measurement.

I tested in Azure where you pay hosting. This should not be taken as an official guarantee, but I measured 2x to 3x lower latency than the regular variant.

FlexibilityYou can still bend it with prompts.

It is a specialized model, but still generative. Small instruction tweaks or in-context examples work. Bigger changes mean another fine-tuning run.

BERT is older, but for text classification it works very well. Generative models are decoder-only and learn to predict the next token. An encoder-only model does not generate text, but processes the whole input well and outputs a class.

BERT: older approach, still strong

Click a card for the detail

QualityVery good, but fine-tuned LLM is higher.

A simple and briefly trained variant ended up clearly above the plain LLMs and similar to logistic regression on embeddings. I believe BERT could be pushed further.

CostModel is free-ish; operation is not.

We need a GPU machine. I counted two older NVIDIA T4 nodes on Azure VM at $0.56/h. Training ran for 500 seconds, so almost nothing. But inferencing costs compute and that brings BERT into GPT-4.1-nano territory.

LatencyBest in the test.

BERT ran locally on Azure VM, so it did not have the same cloud round-trip as the other variants. Even if 0.09 s became 0.12 s in a more realistic setup, it is still excellent.

FlexibilityThe biggest weakness.

You need code or AutoML; the model is single-purpose and you cannot tune it via a prompt. Plus I started from BERT, which does not speak Czech, so Czech and multilingual scenarios are harder.

This is a combination of modern pre-training and prehistorically simple classification. An embedding model turns text into the latent space, roughly 1500 dimensions, and logistic regression decides the three classes.

Embeddings on their own are not a classifier. They will not say "positive" or "negative". But they are a universal representation of meaning. On top of them I can put a tiny model that trains quickly, runs on CPU, and costs almost nothing. Inferencing is then the cloud call for embeddings plus a local classification computation.

Embeddings + LR

Click a card for the detail

QualitySurprisingly good with the full dataset.

Training on the full dataset puts the result clearly above the plain LLM and close to BERT.

CostLowest by orders of magnitude.

The cost of logistic regression is negligible. What matters is the embedding token price and that is very low.

LatencyVery good.

Embeddings come back from the cloud quickly and the local LR model is instant.

FlexibilitySingle-purpose, but language-friendly.

Like BERT, the final classifier is fixed. Unlike BERT, modern embeddings typically speak Czech and other languages, which is a big practical advantage.

What I would pick

None of the variants is the absolute winner. It depends on the task, volume, language, and how much future flexibility you want.

Scenario	Choice	Why
I am a nerd and a saver	LR + embeddings	Lowest cost, great speed, more complex and single-purpose.
I want something general and tunable over time	LLM	Baseline quality, higher cost and latency, but extreme simplicity.
I want best quality and relatively easy operation	Fine-tuned LLM	Best quality, hosted, but specialized and more expensive.
I am a nerd, no cloud, English only, GPU to spare	BERT fine-tuning	Very good quality, low cost with good utilization, best latency, least flexibility.

My practical shortcut

If the task is not yet stable, I would start with a plain LLM and a prompt.
If you have high volume and a clearly defined class, try embeddings + a simple classifier.
If you chase quality and want hosted operation, reach for fine-tuning an LLM.
If you have a strong ML team and your own compute, an encoder model makes sense.

Of course, this is my example; in yours it may turn out differently. The important thing is not to think only about accuracy, but also about cost, latency, operation, and future changes in the task.