AI on Tom Burkert

Multilingual support in LLMs is not a nice-to-have

Mon, 30 Mar 2026 21:39:05 +0200

I remember back in November 2022, when ChatGPT was released and everyone was going crazy about how smart and humanlike it was. I sat down and started talking to it in Czech. After reading so many raving reviews and hyperbolic predictions, I was shocked by how quickly I ran into the model’s limits in Czech. Now don’t get me wrong: I know my native tongue is not a top-tier language, given it only has about 10 million native speakers. I don’t expect AI models to be as proficient in Czech as in English or other major languages. But it still made me think: if I am getting a much worse experience from large language models than English speakers are, what is the experience like for people who speak languages with even less digital support?

It set me on a path toward working on multilingual LLM capabilities and the challenge (and fun) of measuring them. We were already helping some major labs train AI models in various languages, and some of our clients used to give even longer-tail languages a lot of attention; some still do. But I feel like the “race to AGI” has completely overshadowed the importance of truly inclusive multilingual systems.

Let’s start with OpenAI and its latest model, GPT-5.4: the press release, which is almost a 4,000-word document, mentions the words “language”, “linguistic” or “multilingual” exactly zero times. OpenAI used to give some attention to multilingual performance of its models: it translated the Massive Multitask Language Understanding (MMLU) benchmark into 14 languages and used to report the scores in its press releases and models’ system cards. For the newest models, the scores are not mentioned in the press release, and even worse, not even in the system card. The only acknowledgment that the model can handle non-English text appears in a footnote. Neither the note nor the linked help article makes clear whether “support” refers to UI localization or actual model capability, and, if the latter, whether that applies to text, voice, or both.

OpenAI is not alone in this. Anthropic also makes no mention of multilingual capabilities in its press release for Claude Opus 4.6, unless you count “multilingual coding”, that is, the knowledge of different programming languages. Anthropic does produce a 213-page system card, and it actually does comment on the multilingual MMLU score! Sadly, the commentary spans two sentences and does not even provide per-language statistics.

When it comes to documenting which specific languages are supported, Anthropic is even worse than OpenAI. Anthropic has a Multilingual support page, which seems like a good start, until you realize that’s just a place to put results for the multilingual MMLU benchmark. At the time of writing, that page does not contain results for Claude Opus 4.5, Opus 4.6, or Sonnet 4.6. Anthropic also does not provide the full list or even a number. Instead, it just says “Note that Claude is capable in many languages beyond those benchmarked below.” Will it work well in Czech? Or Basque? Who knows! The most we get from Anthropic is that “Claude processes input and generates output in most world languages that use standard Unicode characters.”

There are AI labs that do better than the two behemoths. Mistral, for example, touts its Mistral Large 3 as “The next generation of open multimodal and multilingual AI” and claims its model supports more than 40 languages. The model page and the technical documentation provide no additional information (in fact, they do not even repeat the figure), but at least you can get a partial list at Hugging Face. Not from Mistral’s own docs, mind you, but from Hugging Face.

Not everyone does this badly, though. Google publishes a full list of supported languages, both generally and by model family. Similarly, Alibaba publishes a full table of supported languages in its model announcements. Both also report multilingual MMLU scores in their model cards; aggregated rather than per-language, but at least present.

Why does this matter? It is easy to argue that English is the lingua franca, that many knowledge workers already use it, and that this is especially true in higher-income markets. But that view is short-sighted. If frontier AI labs expect their products to become truly universal, multilingual support cannot remain an afterthought. Sooner or later, it becomes both a product-quality issue and a growth issue. Once the English-speaking market is saturated, the next wave of users will come from other languages and cultural contexts. And if labs are serious about serving them, they need to start treating multilingual capability as something to measure, document and improve, not something users are left to discover on their own.

To be fair, multilingual performance has genuinely improved. Models handle Czech far better today than they did in 2022, and the gap between English and non-English scores on standardized benchmarks has narrowed. I have some reservations about those benchmarks that I’ll address in a separate post, but the trend is clear. The question is how much of that improvement is concentrated effort and how much is just a side effect of more linguistically diverse pre-training data or better model architectures. And I’m not claiming labs do no multilingual work internally; I’m saying the fact that they don’t consider it worth reporting tells you where it sits among their priorities. What isn’t publicly reported isn’t being held to a standard. More importantly, multilingualism in LLMs is far from a solved problem, despite the lack of attention it receives.

I have spent much of the last several years working on this problem directly. My team and I built a multilingual benchmark that deliberately moves away from automated public benchmarks such as multilingual MMLU. Those benchmarks are useful, but they mostly test multiple-choice comprehension rather than real conversational ability. Our study focuses instead on open-ended language generation and manipulation, evaluated at scale by language professionals. I do not usually write about my day job on this blog, but this is one of the rare cases where it feels directly relevant. It is about as close to a passion project as work gets for me. We call it the “Multilingual LLM synthetic data generation study,” and the full 80-page report is available here.

Of course, releasing a report once a year is not enough. We are already working on something that should shed more light on models’ multilingual performance on a more continuous basis. Perhaps the most important thing, though, is to get people talking about this. While I do not believe LLMs are the road to AGI, and I have been vocal on this very blog about their negative aspects, I also believe there is a great deal of potential in them in their current form. There are underserved communities across the world that could benefit from easier access to language models that can communicate with them well.

Czech has gotten much better since 2022. But if models still trip up in a language with 10 million speakers and a strong online presence, imagine what it’s like in Kinyarwanda, Fijian, or Kyrgyz.

An AI taught me to listen to birds

Sun, 01 Mar 2026 16:46:25 +0100

A few days ago, I opened the window and heard a familiar sound I have not heard in a while: it was a fieldfare’s call, probably one of the first who migrated back to my town this year. And just today, I saw my first chaffinch of the year perched on top of a branch. These birds are very common where I live, so this is not too surprising, but what struck me is how much I have grown in terms of knowledge of their calls, appearance and behaviour. A few years ago, I wouldn’t have been able to name a single bird by call or name too many of them by looking at them. If you’re curious how I got here or want to build some knowledge and appreciation of birds, read on.

Before I go any further, I have to stress that I am not a bird expert and I am not even a super serious bird watcher / birder. But I have built an immense admiration for birds and whenever I go out and especially when I travel, I am on the lookout for them. I have always been fascinated by the outdoors, but my interest in birds actually started when I kept hearing a strange call at a playground in front of our house and kept wondering what that was. It turned out it was a fieldfare, but I only knew after I downloaded the Merlin Bird ID app and managed to capture the call. Here it was, clearly matching the already captured calls, with a description and plenty of additional information. Bingo!

The magic of the Merlin Bird ID app is being able to identify a bird in multiple ways: either going through a decision tree about your location, the bird’s appearance and behaviour, or through recording its calls, or from a photo. I’d argue these are some of the most brilliant ways artificial intelligence has been used. As Ian Campbell put it in his Inverse article about the app, it’s “an AI that helps push you away from your phone and appreciate the world around you,” which stands in stark contrast to algorithmic feeds and the personalized hell of social media that try to suck you into your phone’s screen.

Most importantly, the Merlin app set me on a path to grow my interest in birds, because it made identifying birds easy and frictionless; it gave me a great head start and the motivation to continue. Soon, I didn’t even need to pull out my phone to identify the more common birds in my area just by their call. It may not be the most useful skill in the world, but I was so proud of myself, and felt happiness and warmth for reconnecting with the natural world, something we seem to do less and less.

The app also encourages you to collect “lifers” (birds you were able to identify), very much like you’d collect Pokemon. Gotta catch them all! And I do try to catch them all, whether it’s blue jays and great-tailed grackles in Texas, kestrels in the UK or common magpies in Spain. It is genuinely fun to collect them, and it invites you to learn more about the ecosystems of your neighborhood, as well as places you visit. The app is created by the Cornell Lab of Ornithology, is fully free and has so far avoided any enshittification, which I am very glad for.

Perhaps the most interesting bit is that this little app has changed me in a meaningful way: I acquired quite a few books on birds and learned more than what the app itself can provide. I now jump with joy when I hear the first common swifts fly over our house in droves, screaming “sreeee” over each other. We rescued quite a few birds in distress, and we watched with delight and anticipation when a pair of blackbirds made a nest on our balcony; doubly so when we noticed that it’s the same pair we see each year at our bird feeder. Identifying a single bird’s call led me on a path that I could not have foreseen. So I urge you: give it a try. You might thank yourself in a year or two!

How many AI PhDs does it take to change a hinge?

Mon, 08 Dec 2025 21:35:09 +0100

It started with a broken hinge in our kitchen cabinet. I am no handyman, but I like to do as much as I can around the house and I dare say I’ve gotten pretty okay at it. The one thing that can set me back, though, is not knowing the correct terminology. What type of hinge do I need? Concealed? Inset? Overlay? Clip-On? Soft-close? Sprung? An avalanche of terms that I only barely understand paralyzes me. Luckily, there’s a new helper in town: AI!

I’ve used vision-enabled LLMs before to identify types of tools, bike parts, or PC components with mixed results, but the net result was usually positive. At the least, I became armed with some of the terminology and knew what to look up on Google. The recent releases of Gemini 3 Pro ("The frontier of vision AI"), GPT-5 ("can reason more accurately over images and other non-text inputs") and Claude Opus 4.5 ("better vision, reasoning, […] and it is state-of-the-art in many domains") have made me curious to test the much hyped visual reasoning capabilities. Plus, they are touted to have PhD- or expert-level intelligence. Can an AI PhD help me change a hinge?

Mission (im)possible

I set out with a simple request: “Identify this type of hinge and find out where it can be bought in Czechia,” accompanied by the following picture:

Now, a small aside about hinges — the little I know. I promise it won’t be long, it can be explained in 30 seconds.

This is a hinge with a cup:

And this, my friends, is a type of hinge that does not use a cup, but is mounted directly to the face frame of the cabinet:

You can tell them apart very easily - I actually think 99% of you could now tell them apart without any issues. You can probably also tell that the hinge from my original picture is very clearly the latter one, as there is no place to mount the cup to in the cabinet face frame.

Our PhD-level expert LLM friends? Not so much.

Getting unhinged

Let’s start with Gemini. Even though I prompted it in English, it started thinking and responding in Czech, so I can’t quote it verbatim, but here’s a faithful translation from Google Translate:

The specific type in the photo appears to be: A cup hinge with a cup diameter of 35mm (this is the standard size, but it’s a good idea to check).

OK, that’s clearly wrong.

Let’s move on to Claude Opus 4.5, with excellent visual reasoning:

This is a Hettich concealed cabinet hinge (also called a European-style or cup hinge).

Not really! But wait, Claude is searching the web, maybe it can recover:

Based on my search, this is a Hettich Intermat concealed cabinet hinge (also known as a “miskový závěs” or cup hinge in Czech). It features a 35mm cup diameter and includes a clip-on mounting plate for easy installation.

Not quite either.

Maybe ChatGPT with GPT-5 will save the day?

Based on your photo, this appears to be a “European-style concealed cabinet hinge” — but specifically a variant that is face-frame / surface-mounted rather than the hidden “cup-in-door” type often used in frameless cabinets.

Alright, that’s much better! It seems that GPT-5 can reason more accurately over images. So what is its recommendation?

My Recommendation
Given your photo and situation, I’d start by ordering a clip-on concealed hinge for face-frame cabinets with a 35 mm cup — either basic or soft-close depending on whether you care about silent closing. That has the highest chance of matching the old hinge and mounting method.

Somehow, in the stream of tips and examples, GPT-5 convinced itself that this was after all a hinge with a cup, and ended up giving me the wrong recommendation.

I honestly do not know why all the models faceplanted so badly. It does seem that my type of hinges is relatively rare, but at least GPT-5 identified it correctly at first.

Just ask a real expert

Turns out, the AI labs aren’t optimizing for mundane DIY realities, and so you’d be better off asking literally anyone at your local hardware store or DIY shop. Even the SOTA LLMs are not there in terms of giving you reliable information for this category of problems. They fell short of even identifying my hinge, much less finding a site where I could buy a replacement.

The smallest of wins

There is a silver lining. When I confronted the LLMs with the fact that the hinge in the picture clearly does not have a cup, all three recovered and provided some genuinely useful information. They helped me nail down the terminology, and pointed me toward several e-shops I could browse to look for what I needed.

However, none of them were able to identify a specific product I could actually order. And perhaps more importantly, they only became helpful after I called them out on their mistakes. If I had trusted the initial responses (or if this had been a topic I knew even less about), I could easily have ended up ordering the wrong part.

An assistant you have to constantly fact-check and correct is of questionable value. For now, these expensive statistical behemoths remain more useful as a second opinion than a first resort for this type of problems.