AI Transaction Categorisation 2026 EU — Accuracy and Banks
How AI transaction categorisation works in 2026, why it hits 92-97% accuracy on EU bank feeds, which banks are supported via PSD2 and how it beats rules.
AI Transaction Categorisation 2026 Europe — How It Works, Accuracy
TL;DR
Transaction categorisation is the unglamorous foundation under every personal finance app. Get it wrong and every chart, budget and forecast is wrong too. In 2026, AI-based categorisation hits 92-97% accuracy on European bank feeds out of the box, vs 70-80% for the rule-based engines that dominated 2010-2022.
Key numbers in 2026:
- LLM-based categorisation accuracy: 92-97% (EU feeds, top 100 merchants).
- Rule-based legacy: 70-80% (long tail much worse, often <50%).
- Latency per transaction: 80-300 ms for cached merchants, 500-1500 ms for cold LLM calls.
- Cost per 1000 categorisations: 0.002-0.015 EUR (with prompt caching, batch processing).
- Coverage: all major Polish, German, French, Spanish, Italian, Dutch and Scandinavian banks via PSD2 AISP licences.
Where AI wins: long-tail merchants, ambiguous descriptors ("AMAZON EU SARL"), local idiom (PL: "Żabka Polska sp z o o"), recurring detection across renamed merchants. Where AI still struggles: handwritten check images (rare in EU), cross-border tax-relevant splits (you pay 100 EUR in Berlin in PLN, half business / half personal), and category schemes that the user heavily customised.
Disclaimer: AI tools augment but don't replace qualified financial/tax advice. Verify all AI outputs.
What is transaction categorisation?
Categorisation is the act of taking a raw bank line such as:
2026-04-12 -47.20 PLN PAYU*BIEDRONKA-2451 WARSZAWA
and mapping it to a human-meaningful category like "Groceries → Discount supermarket". Most apps maintain a tree of 8-12 top-level categories (Housing, Food, Transport, Entertainment, etc.) with 50-200 sub-categories.
This sounds trivial. It isn't. A typical Polish household generates 80-180 transactions per month. Across 12 months, that's 1000-2200 lines. A 75% categorisation accuracy means 250-550 wrong lines per year — enough to make every spending chart misleading.
Why it's hard:
- Merchants rename themselves ("ALLEGRO PAY" → "Allegro Pay sp. z o.o.").
- The same merchant can be different categories (Amazon = groceries, electronics, books, or AWS hosting).
- Polish bank feeds add weird artefacts: payment provider prefixes (PAYU, BLIK, TPAY), encoding errors, mixed case.
- Reimbursements and refunds look like income; transfers between own accounts look like expenses.
- Cash withdrawals and BLIK transfers have no merchant data at all.
How AI categorisation works under the hood
A 2026 AI categorisation pipeline typically looks like this:
- Ingest the raw transaction via PSD2 AISP API (or screen-scraping fallback).
- Normalise: trim noise prefixes (PAYU*, TPAY*, BLIK), uppercase, remove digits, detect currency.
- Merchant lookup: hit a cache of (normalised string → canonical merchant ID). Hit rate is typically 70-85% for an established app.
- For cold misses: send the merchant string + MCC + amount + recent user context to an LLM with a prompt like:
You categorise a single bank transaction. Merchant: ZABKA NANO 3814 WARSZAWA MCC: 5411 Amount: -12.40 PLN Date: 2026-04-12 22:47 User's typical patterns: small late-night purchases at convenience stores = snacks. Output: {category: "Groceries", subcategory: "Convenience", confidence: 0.97} - Cache the result keyed by (merchant_normalised, user_id) so the next time it's free and fast.
- Confidence threshold: if confidence < 0.8, surface to the user with "is this 'Groceries' or 'Eating out'?" and learn from the answer.
- Periodic re-classification: when the model updates or the user changes their category scheme, batch re-run on history.
Latency breakdown in 2026:
- Cache hit: 5-20 ms.
- Embedding-based nearest neighbour (in case of slight variation): 30-80 ms.
- Cold LLM call (Claude Haiku 3.5 / GPT-4o mini / Gemini Flash 2.5): 500-1500 ms.
- Batch LLM (overnight): 50-200 ms per transaction amortised.
Costs (vendor-side) per 1000 categorisations:
- Cache hit: ~free.
- Embedding: ~0.0001 EUR.
- Cold LLM: 0.002-0.015 EUR (depending on model and prompt caching).
Privacy:
- Some vendors strip the user identifier and send only the merchant string + MCC.
- Others send the user's category history for context (better accuracy, more data leaves the device).
- On-device categorisation with Phi-4 / Gemma 3 / Llama 3.3 8B is viable on flagship phones in 2026 for cached merchants; cold misses still typically hit the cloud.
State of the art 2026 — what categorisation can and cannot do
Reliable:
- Identifying top-100 merchants in each country (accuracy >97%).
- Recognising recurring subscriptions across renamed providers (Netflix → "NETFLIX EU SARL" → "NETFLIX INTERNATIONAL BV").
- Detecting one-off vs recurring purchase patterns.
- Distinguishing internal transfers from real expenses if both account sides are connected.
Partially reliable:
- Splitting business vs personal expenses for a freelancer (needs explicit rules or per-merchant tagging).
- Cash withdrawals (no merchant info; AI guesses based on amount, location, day of week).
- Refunds and chargebacks (need to be netted against the original purchase).
Unreliable:
- Tax-relevant splits without explicit user input.
- Categorising obscure local merchants on first sight in low-coverage countries.
- Handling user-customised category trees with 50+ sub-categories (model accuracy drops sharply beyond ~30 categories).
Banks supported via PSD2 in EU 2026
A mature AI categorisation app, often via aggregators (Tink, TrueLayer, Yapily, GoCardless Bank Account Data, Klarna Kosma):
| Country | Major banks supported (non-exhaustive) |
|---|---|
| Poland | PKO BP, mBank, Pekao, Santander, ING Bank Śląski, Millennium, Alior, BNP Paribas, Credit Agricole, Citi Handlowy, Nest, BOŚ |
| Germany | Sparkasse, DKB, ING DE, Commerzbank, Deutsche Bank, N26, Comdirect |
| France | BNP Paribas, Crédit Agricole, Société Générale, La Banque Postale, Boursorama, Revolut FR |
| Spain | Santander, BBVA, CaixaBank, Sabadell, Bankinter |
| Italy | Intesa, UniCredit, BPER, Banca Mediolanum, Fineco |
| Netherlands | ING, ABN AMRO, Rabobank, bunq |
| Sweden / Norway / Denmark | SEB, Swedbank, Nordea, Handelsbanken, DNB, Danske |
Coverage gaps in 2026 are usually short-lived (a few weeks during a bank's API rotation). Polish coverage is essentially complete for retail banking.
Compared to rule-based engines
| Aspect | Rule-based | ML/Hybrid (2018-2022) | LLM-based (2024-2026) |
|---|---|---|---|
| Accuracy on top 100 merchants | 85-90% | 90-94% | 95-98% |
| Accuracy on long tail | 30-55% | 60-75% | 85-93% |
| Handles renamed merchants | No | Sometimes | Yes |
| Handles new merchants out of the box | No (needs rule update) | Slowly | Yes |
| User-correction learning speed | Days | Hours | Minutes |
| Cost per 1000 | ~free | low | 0.002-0.015 EUR |
| Latency cold path | 5-20 ms | 50-100 ms | 500-1500 ms |
| Latency warm path | 5-20 ms | 20-50 ms | 5-20 ms (cached) |
| Setup complexity for vendor | Low | Medium | Medium-high |
The rule-based engines that powered Mint, MoneyDashboard, and early Yolt suffered from a long-tail problem: a country has 50000+ active merchants, and curating rules for each is impossible. LLMs generalise — they can categorise a merchant they've never seen before from the name alone.
Real-world example
Consider Anna, the 32-year-old freelancer from Warsaw earning irregular income. Month 1 with each approach:
Rule-based app (legacy Mint-like):
- Total transactions: 142.
- Correctly auto-categorised: 96 (68%).
- Marked "Other" or wrong: 46.
- Manual correction time: ~50 minutes.
- Anomalies detected: 0.
ML-hybrid (Monarch-style):
- Correctly auto-categorised: 121 (85%).
- Manual correction time: ~22 minutes.
- Anomalies detected: 1 (Netflix increase).
LLM-native (Freenance / Cleo / Finch):
- Correctly auto-categorised: 136 (96%).
- Manual correction time: ~6 minutes.
- Anomalies detected: 3 (Netflix +4 EUR, duplicate insurance debit, 12% FX markup on EUR client invoice).
Across a year, the time saved is roughly 30-45 hours — comparable to a working week.
Limitations and risks
- Hallucinated categories — an LLM can invent a category that doesn't exist in your tree. Mitigation: constrain output to a fixed enum.
- Drift — model updates can subtly shift category assignments. Apps need version pinning or human review for big swings.
- Cross-border ambiguity — a charge in EUR from a Berlin merchant for a Polish freelancer could be travel, business expense, or personal. AI can't tell without context.
- Privacy — merchant + amount + time can de-anonymise a user. Minimise context sent to third-party LLMs.
- Cost creep — sending every cold transaction to an LLM costs money. Caching is essential; vendors that don't cache often raise prices or restrict the free tier.
Cost vs value
For a typical EU consumer app, AI categorisation costs the vendor 0.10-0.40 EUR per active user per month. Most of that is amortised across cached merchants, with cold calls handled by cheaper models (Haiku, Flash, GPT-4o mini).
Value to the user:
- 30-45 hours saved per year on manual cleanup.
- Better signal-to-noise in spending charts.
- Earlier anomaly detection (often saves 50-300 EUR per year on fee/subscription surprises).
- Better tax position because deductible expenses are correctly tagged.
What to look for when choosing
Checklist:
- Categorisation accuracy reported on EU feeds (not US — different merchant mix).
- Coverage of your country's banks via PSD2 AISP.
- Clear privacy disclosure: what merchant strings, amounts, dates leave the device.
- Choice of model (Claude / GPT / Gemini / on-device) and which goes where.
- Ability to correct categories with quick learning (correction should propagate within minutes, not weeks).
- Custom category scheme support without accuracy collapse.
- Bulk re-categorisation when the user changes their tree.
- Export raw transactions with both original and AI-categorised fields.
Polish reader angle
In Poland, the practical issues are:
- Many merchant strings include the payment provider prefix (PAYU*, TPAY*, PRZELEWY24*) — modern apps strip these before categorisation.
- BLIK transfers carry no merchant data — AI can only categorise based on amount, time and recipient phone number if visible.
- Cash withdrawals from ATMs (Bankomat, Euronet, Planet Cash) need follow-up tagging because the actual spend isn't visible to the bank.
- Polish quirks: utility bills via PUE/ZUS, mandates for ZUS contributions for jednoosobowa działalność, tax payments to mikrorachunek podatkowy. A 2026 AI categoriser should know all these.
KNF doesn't regulate categorisation specifically, but if categorisation feeds into tax estimates or any "advice" surface, it falls into MIFID II / KNF observation. UODO requires that any transaction data sent to a third-party LLM has a documented legal basis (typically Art. 6(1)(b) GDPR — performance of contract).
Where Freenance fits
Freenance focuses on EU-native AI categorisation with strong tuning for Polish merchants — Biedronka, Lidl, Żabka, Allegro, Empik, Empik Foto, Stokrotka, Carrefour, Auchan, Rossmann, Hebe, Media Expert, RTV Euro AGD, plus the BLIK and ZUS quirks that US-trained models often miss. The categorisation feeds into Freenance's Financial Freedom Runway metric and AI assistant, so accuracy compounds into trustworthy forecasts.
FAQ
How accurate does it really need to be? Below 90%, users notice errors weekly and lose trust. Above 95%, errors feel like edge cases. The jump from 80% (legacy) to 95% (LLM) is the difference between "I have to babysit this" and "I trust the chart".
Does AI categorisation work for cash purchases? Only if you log them manually. Cash leaves no bank trace. Some apps let you snap a receipt and the AI OCR + categorises it.
Will it learn my custom categories? Modern systems learn within a few corrections per category. If you create "Hobby: 3D printing" and tag 3-5 transactions, the model will pick up the pattern.
Can it split a single transaction into multiple categories? Yes, most 2026 apps support splits. AI can sometimes suggest splits (e.g. a Carrefour bill with both groceries and electronics) but for tax-relevant splits you should always confirm manually.
What about transactions in foreign currency? The AI handles them but the FX rate and any fees should be tagged separately so spending vs fees is distinguishable.
Is on-device categorisation good enough? For cached merchants and top-100, yes (2026 small models reach 93-95%). Cold misses on unusual merchants still benefit from a larger cloud model.
Deep dive — how merchant strings actually look in EU bank feeds
To understand why categorisation is hard, look at real-world descriptors that show up in a Polish PSD2 feed:
PAYU*BIEDRONKA-2451 WARSZAWA— payment provider prefix, then merchant brand, then internal store number, then city.BLIK 1234567890 PRZELEW NA TELEFON— phone-number transfer, no merchant identity at all.ZUS NA UBEZP. SPOLECZNE/12 2026 IND. NUMER— social security payment with month + individual taxpayer number.mTransfer WPLATA NA RACHUNEK MAKLERSKI— own-account transfer to brokerage, easy to misread as an expense.AMAZON EU SARL LUXEMBOURG LU— same merchant name covers groceries, electronics, books, AWS hosting, Prime subscription.UBER *EATS 8.40 EUR FX 0.23— sub-merchant (Eats vs Rides) with FX fee bundled.PRZELEW WLASNY SPLATA KARTY KREDYTOWEJ— credit-card repayment, must not double-count if both cards and current account are connected.
A rule-based engine handles maybe 30-40% of these elegantly. An LLM trained or fine-tuned on EU patterns handles 90+% with the right prompt and history context. The difference is whether you trust the spending chart at the end of the month.
How AI handles the long tail of small Polish merchants
Polish retail includes thousands of small operators that never make a US-trained model's training set: regional bakeries, kioski Ruch, second-hand vinyl shops, weekly farmers' market vendors. A capable 2026 system handles them by:
- Embedding the merchant string and finding the nearest neighbour in a known taxonomy ("PIEKARNIA HALINA" → bakery cluster).
- Combining MCC (5462 = bakery) with the embedding.
- Using user history — if you've shopped at small bakeries 30 times this year, the model anchors confidently.
- Falling back to "Other → Food (small business)" with a prompt to the user only when confidence < 0.7.
Result: the long-tail accuracy in the table above (85-93%) holds even in markets dominated by small operators.
Continuous learning loop
Modern AI categorisation isn't static. Each user correction feeds back:
- User opens the transaction list and changes "Other" → "Hobby: ceramics".
- App stores the (merchant_normalised, user_id, category) triplet.
- Within minutes, the next time the same merchant string appears, the model proposes "Hobby: ceramics" with high confidence.
- Weekly or monthly batch retrains an embedding layer or fine-tunes prompt context.
- Anonymised, aggregated learning (only with user consent) improves the shared model for everyone.
The speed of this loop matters. Rule-based systems took days because a human had to write the rule. ML pipelines took hours. LLMs propagate within minutes because the correction goes straight into the user's context window or fine-tuning prompt.
Sources
Vendor documentation as of 2026: Cleo, Plum, Monarch, Copilot Money, Finch, Magnifi, Freenance, Tink, TrueLayer, Yapily, GoCardless, Klarna Kosma. Regulatory: KNF supervisory communications, UODO guidance on processing financial data, GDPR Articles 5 and 6, PSD2 / Ustawa o usługach płatniczych. Academic: comparative studies on LLM transaction categorisation accuracy (2024-2025), Anthropic and OpenAI model evaluations, Polish national merchant databases.
Want full control over your finances?
Try Freenance for free