Tomáš Kubica

How to do simple AI text analysis? LLM, fine-tuning, BERT, or embeddings?

Using short-text sentiment, I compare quality, cost, latency, and practicality across four approaches.

Decades ago, natural-language processing was mostly about specialized models for relatively simple tasks. They were small, cheap to run, and you had to know what you were doing. Today we have large language models that can follow instructions and often do the same thing without training your own model.

Let's use sentiment as the lab task. I take a short social-media post and want to put it into one of three classes: positive, negative, neutral. This is not meant as an academic benchmark. It is the practical question: what should I use, how good will it be, how fast will it be, and how much will it cost?

Introduction to the comparison

For the same classification task I have four practical paths. The next cards walk through each one, but here is a quick preview of what I will be comparing.

I instruct the model to return one class. I try almost-zero-shot and few-shot with 100 or 1000 examples. This is the simplest variant, but cost and latency grow with context length.

Best accuracy

fine-tuned GPT-4.1-mini, 80.0%.

Lowest cost

embeddings + logistic regression, about $0.0005 per 1000 samples.

Lowest latency

BERT, about 0.09 seconds per sample.

Easiest changes

plain LLM, because I change the prompt, not the model.

The accuracy may not look amazing at first glance. Part of the reason is that the boundary between positive and neutral sentiment is often fuzzy. Some posts are simply labeled differently by different people.

Classification quality

Group Experiment Model / approach Strategy Accuracy Run cost Time Samples/s
LLM LLM_NANO_ZEROSHOT GPT-4.1-nano Zero-shot 67.7% $0.13 3306.7 s 1.56
LLM LLM_NANO_FEWSHOT_100 GPT-4.1-nano Few-shot 100 68.6% $1.45 (~$0.87 cached) 4215.4 s 1.23
LLM LLM_NANO_FEWSHOT_1000 GPT-4.1-nano Few-shot 1000 68.4% $13.39 (~$8.03 cached) 9160.4 s 0.56
LLM LLM_MINI_ZEROSHOT GPT-4.1-mini Zero-shot 69.3% $0.55 3919.9 s 1.38
LLM LLM_MINI_FEWSHOT_100 GPT-4.1-mini Few-shot 100 69.4% $5.88 (~$3.53 cached) 5565.6 s 0.94
LLM LLM_MINI_FEWSHOT_1000 GPT-4.1-mini Few-shot 1000 68.4% $55.53 (~$33.32 cached) 16084.1 s 0.33
Fine-tuned LLM LLM_FT_NANO_ZEROSHOT GPT-4.1-nano FT Zero-shot 79.0% $0.26 1907.0 s 2.71
Fine-tuned LLM LLM_FT_NANO_FEWSHOT_100 GPT-4.1-nano FT Few-shot 100 78.4% $2.90 (~$1.74 cached) 1940.0 s 2.66
Fine-tuned LLM LLM_FT_MINI_ZEROSHOT GPT-4.1-mini FT Zero-shot 80.0% $1.04 1564.2 s 3.30
Fine-tuned LLM LLM_FT_MINI_FEWSHOT_100 GPT-4.1-mini FT Few-shot 100 79.3% $11.58 (~$6.95 cached) 1918.4 s 2.69
Embeddings + ML LR_100 OpenAI Embeddings + LR 100 samples 60.9% $0.0025 963.2 s 5.40
Embeddings + ML LR_1000 OpenAI Embeddings + LR 1000 samples 67.1% $0.0025 951.7 s 5.47
Embeddings + ML LR_ALL OpenAI Embeddings + LR full dataset 73.2% $0.0025 932.1 s 5.58
Transformer encoder BERT_SENTIMENT BERT-base FT full dataset 74.5% $0.028 465.4 s 11.19
Experiment Completed Failed Total tokens Input tokens Output tokens
LLM_NANO_ZEROSHOT 5166 (99.3%) 39 (0.7%) 1,289,946 1,284,780 5166
LLM_NANO_FEWSHOT_100 5165 (99.2%) 40 (0.8%) 14,470,776 14,465,611 5165
LLM_NANO_FEWSHOT_1000 5166 (99.3%) 39 (0.7%) 133,890,834 133,885,668 5166
LLM_MINI_ZEROSHOT 5076 (97.5%) 129 (2.5%) 1,351,565 1,346,144 5421
LLM_MINI_FEWSHOT_100 5140 (98.8%) 65 (1.2%) 14,688,863 14,683,620 5243
LLM_MINI_FEWSHOT_1000 5088 (97.8%) 117 (2.2%) 138,814,519 138,809,163 5356
LLM_FT_NANO_ZEROSHOT 5166 (99.3%) 39 (0.7%) 1,289,946 1,284,780 5166
LLM_FT_NANO_FEWSHOT_100 5166 (99.3%) 39 (0.7%) 14,473,578 14,468,412 5166
LLM_FT_MINI_ZEROSHOT 5166 (99.3%) 39 (0.7%) 1,289,946 1,284,780 5166
LLM_FT_MINI_FEWSHOT_100 5165 (99.2%) 40 (0.8%) 14,470,685 14,465,520 5165
LR_100 5205 (100%) 0 124,123 124,123 0
LR_1000 5205 (100%) 0 124,123 124,123 0
LR_ALL 5205 (100%) 0 124,123 124,123 0
BERT_SENTIMENT 5205 (100%) 0 N/A N/A N/A

Cost per 1000 samples

Category Experiment Cost per 1000 samples Rank
Cheapest LR_100 $0.0005 1
Cheapest LR_1000 $0.0005 1
Cheapest LR_ALL $0.0005 1
Low LLM_NANO_ZEROSHOT $0.025 4
Low BERT_SENTIMENT $0.028 5
Low LLM_FT_NANO_ZEROSHOT $0.050 6
Low LLM_MINI_ZEROSHOT $0.106 7
Low LLM_FT_MINI_ZEROSHOT $0.202 8
Medium LLM_NANO_FEWSHOT_100 $0.280 ($0.168 cached) 9
Medium LLM_FT_NANO_FEWSHOT_100 $0.562 ($0.337 cached) 10
Medium LLM_MINI_FEWSHOT_100 $1.130 ($0.678 cached) 11
Medium LLM_FT_MINI_FEWSHOT_100 $2.242 ($1.345 cached) 12
Expensive LLM_NANO_FEWSHOT_1000 $2.590 ($1.554 cached) 13
Expensive LLM_MINI_FEWSHOT_1000 $10.670 ($6.402 cached) 14

Latency

Category Experiment Time per sample
Fastest BERT_SENTIMENT 0.09 s
Fast LR_ALL 0.18 s
Fast LR_1000 0.18 s
Fast LR_100 0.19 s
Medium LLM_FT_MINI_ZEROSHOT 0.30 s
Medium LLM_FT_MINI_FEWSHOT_100 0.37 s
Medium LLM_FT_NANO_ZEROSHOT 0.37 s
Medium LLM_FT_NANO_FEWSHOT_100 0.38 s
Slow LLM_NANO_ZEROSHOT 0.64 s
Slow LLM_MINI_ZEROSHOT 0.72 s
Slow LLM_NANO_FEWSHOT_100 0.81 s
Slow LLM_MINI_FEWSHOT_100 1.06 s
Outlier LLM_NANO_FEWSHOT_1000 1.77 s
Outlier LLM_MINI_FEWSHOT_1000 3.00 s

Four options in depth

An LLM can follow instructions, so we can simply tell it: classify this text and return only one output token. In my case 0, 1, or 2. If I let the model think and explain at length it might be more accurate, but also many times more expensive. I care about practical operation here.

What we saw with the LLM

Click a card for the detail

QualityMini is a bit better than nano, but not dramatically.

Few-shot with 100 examples helps a little. Few-shot with 1000 examples actually confuses small models and stretches context so much that the result stops making sense.

CostInput tokens dominate.

That is why few-shot 1000 is practically nonsense: dramatically more expensive and worse. Input-token caching helps for few-shot scenarios, but it does not change the rules.

LatencyShort context is faster; nano is faster.

Sequential processing comes out at roughly 0.64 to 1.06 seconds per sample. The 1000-example variant is completely off, three seconds per sample with the mini model.

FlexibilityThis is where LLM wins.

No training, no pipeline, no specialized model. I change the prompt and get different behavior. One endpoint, one library, one runtime model.

Here I again take the mini and nano models and show them 25,000 training examples. Then I use them zero-shot and with few-shot 100 examples.

Fine-tuned LLM in practice

Click a card for the detail

QualityBest result of the whole test.

Fine-tuning helped dramatically. GPT-4.1-mini reached 80.0%. Interestingly, few-shot stops helping and actually hurts. My guess is that 100 randomly picked examples are not representative against the 25,000 used in training.

CostInferencing is not the whole story.

Fine-tuning itself cost $40. In Azure OpenAI Service the fine-tuned model uses the same token prices as the regular version, but hosting is billed at $1.7/h. With OpenAI you do not pay hosting, but tokens are more expensive. Sometimes one is cheaper, sometimes the other.

LatencyMarkedly better in my measurement.

I tested in Azure where you pay hosting. This should not be taken as an official guarantee, but I measured 2x to 3x lower latency than the regular variant.

FlexibilityYou can still bend it with prompts.

It is a specialized model, but still generative. Small instruction tweaks or in-context examples work. Bigger changes mean another fine-tuning run.

BERT is older, but for text classification it works very well. Generative models are decoder-only and learn to predict the next token. An encoder-only model does not generate text, but processes the whole input well and outputs a class.

BERT: older approach, still strong

Click a card for the detail

QualityVery good, but fine-tuned LLM is higher.

A simple and briefly trained variant ended up clearly above the plain LLMs and similar to logistic regression on embeddings. I believe BERT could be pushed further.

CostModel is free-ish; operation is not.

We need a GPU machine. I counted two older NVIDIA T4 nodes on Azure VM at $0.56/h. Training ran for 500 seconds, so almost nothing. But inferencing costs compute and that brings BERT into GPT-4.1-nano territory.

LatencyBest in the test.

BERT ran locally on Azure VM, so it did not have the same cloud round-trip as the other variants. Even if 0.09 s became 0.12 s in a more realistic setup, it is still excellent.

FlexibilityThe biggest weakness.

You need code or AutoML; the model is single-purpose and you cannot tune it via a prompt. Plus I started from BERT, which does not speak Czech, so Czech and multilingual scenarios are harder.

This is a combination of modern pre-training and prehistorically simple classification. An embedding model turns text into the latent space, roughly 1500 dimensions, and logistic regression decides the three classes.

Embeddings on their own are not a classifier. They will not say "positive" or "negative". But they are a universal representation of meaning. On top of them I can put a tiny model that trains quickly, runs on CPU, and costs almost nothing. Inferencing is then the cloud call for embeddings plus a local classification computation.

Embeddings + LR

Click a card for the detail

QualitySurprisingly good with the full dataset.

Training on the full dataset puts the result clearly above the plain LLM and close to BERT.

CostLowest by orders of magnitude.

The cost of logistic regression is negligible. What matters is the embedding token price and that is very low.

LatencyVery good.

Embeddings come back from the cloud quickly and the local LR model is instant.

FlexibilitySingle-purpose, but language-friendly.

Like BERT, the final classifier is fixed. Unlike BERT, modern embeddings typically speak Czech and other languages, which is a big practical advantage.

What I would pick

None of the variants is the absolute winner. It depends on the task, volume, language, and how much future flexibility you want.

Scenario Choice Why
I am a nerd and a saver LR + embeddings Lowest cost, great speed, more complex and single-purpose.
I want something general and tunable over time LLM Baseline quality, higher cost and latency, but extreme simplicity.
I want best quality and relatively easy operation Fine-tuned LLM Best quality, hosted, but specialized and more expensive.
I am a nerd, no cloud, English only, GPU to spare BERT fine-tuning Very good quality, low cost with good utilization, best latency, least flexibility.

My practical shortcut

  • If the task is not yet stable, I would start with a plain LLM and a prompt.
  • If you have high volume and a clearly defined class, try embeddings + a simple classifier.
  • If you chase quality and want hosted operation, reach for fine-tuning an LLM.
  • If you have a strong ML team and your own compute, an encoder model makes sense.

Of course, this is my example; in yours it may turn out differently. The important thing is not to think only about accuracy, but also about cost, latency, operation, and future changes in the task.