Which Swedish Party Do LLMs Vote For?

Jul 1, 2026

Svenska Dagbladet recently put the big AI chatbots through SVT's Valkompass and reported which parties ChatGPT, Gemini, Claude and Grok picked. It is a nice idea. It also measures something narrower than the headline suggests.

The chat apps they used are not raw models. They are products built around a model, with a system prompt, safety layers, and in most cases live web search. When you ask ChatGPT or Gemini about politics in the app, it can search the web, read a few pages, and use what it finds. So you learn about the product and its search stack, not about the model on its own.

The chat window is also not where models do most of their work. The bulk of the tokens models generate today flows through the API, produced by coding agents, pipelines and other software. OpenAI says its API alone handles more than 15 billion tokens a minute, and on OpenRouter more than half of all token traffic is code generation. The raw model behind an API call, with no product wrapped around it, is the version of the model the world mostly runs on.

We were curious about the narrower question. With the tools, the web and the system prompt taken away, which way does the model lean by itself?

What we changed

We took the 35 Riksdag questions from SVT's Valkompass 2026 and ran every configuration on the Agent Arena leaderboard: 28 entries covering 23 frontier models from Anthropic, OpenAI, Google, xAI, DeepSeek, Moonshot, Z.ai, MiniMax, Alibaba and NVIDIA. Agent Arena ranks models on how well they complete real agentic tasks like tool use, task completion and steerability, rather than chat popularity, which makes it a reasonable definition of "the models that matter right now". Where the leaderboard ranks a thinking variant separately, we ran the model with exactly that reasoning setting, so Claude Opus 4.8 and Claude Opus 4.8 (Thinking) are separate rows here just as they are there. Every call went through the OpenRouter API, with no chat app, no web search, no tools and no system prompt. Only the weights.

For each configuration we compared its 35 answers to every party's official answers, then picked the party it sits closest to. The charts below have the full picture. (We also ran a wider pool of 50 popular models; everything is in the public dataset, but the article sticks to the leaderboard.)

What the models pick

All 28 Agent Arena leaderboard configurations, thinking settings included, and the party each one lands closest to.

1Claude Fable 5 (High)Anthropic

Liberalerna

2Claude Opus 4.8 (Thinking)Anthropic

Liberalerna

3GPT 5.5 (xHigh)OpenAI

Vänsterpartiet

4Claude Opus 4.7Anthropic

Moderaterna

5Claude Opus 4.7 (Thinking)Anthropic

Moderaterna

6GPT 5.5 (High)OpenAI

Vänsterpartiet

7GLM 5.2 (Max)Zhipu

Vänsterpartiet

8GPT 5.4 (High)OpenAI

Vänsterpartiet

9Claude Opus 4.6Anthropic

Socialdemokraterna

10GPT 5.5OpenAI

Vänsterpartiet

11Claude Opus 4.8Anthropic

Liberalerna

12Claude Sonnet 4.6Anthropic

Vänsterpartiet

13GLM 5.1Zhipu

Miljöpartiet

14Kimi K2.7 CodeMoonshot

Miljöpartiet

15Gemini 3.1 Pro PreviewGoogle

Socialdemokraterna

16Gemini 3.5 FlashGoogle

Socialdemokraterna

17DeepSeek V4 FlashDeepSeek

Miljöpartiet

18Kimi K2.6Moonshot

Miljöpartiet

19Minimax M3MiniMax

Centerpartiet

20DeepSeek V4 ProDeepSeek

Socialdemokraterna

21Qwen 3.6 PlusAlibaba

Liberalerna

22Grok 4.3 (High)xAI

Centerpartiet

23Grok Build 0.1xAI

Moderaterna

24Gemini 3 FlashGoogle

Moderaterna

25Minimax M2.7MiniMax

Vänsterpartiet

26Nemotron 3 UltraNVIDIA

Miljöpartiet

27Gemma 4 31BGoogle

Miljöpartiet

28Grok 4.3xAI

Moderaterna

Every configuration on the Agent Arena leaderboard, in leaderboard order, named exactly as ranked there. Entries marked (Thinking), (High), (xHigh) or (Max) were run with that reasoning setting; plain entries run at the provider default. The party shown is the one whose official answers sit closest to that configuration's 35 answers.

Which parties the models lean toward

How much the leaderboard agrees with each party, and which party each configuration lands closest to.

Average agreement with each party, all 28 configurations

Vänsterpartiet

69%

Socialdemokraterna

69%

Centerpartiet

69%

Miljöpartiet

68%

Liberalerna

68%

Moderaterna

67%

Kristdemokraterna

57%

Sverigedemokraterna

49%

How closely the configurations' 35 answers match each party's, averaged over the whole leaderboard (0–100% scale). The six mainstream parties sit within a couple of points of each other; Kristdemokraterna and Sverigedemokraterna sit clearly lower.

Closest match: how many configurations land nearest each party

Vänsterpartiet

Miljöpartiet

Moderaterna

Socialdemokraterna

Liberalerna

Centerpartiet

Kristdemokraterna

Sverigedemokraterna

For each configuration we take its single best-matching party. None land closest to Kristdemokraterna or Sverigedemokraterna.

By company

The same agreement numbers rolled up per company, across its leaderboard configurations.

Company
Anthropic 7 configurations	68	73	69	71	74	63	75	55
Google 4 configurations	66	71	66	67	69	60	69	53
OpenAI 4 configurations	74	70	70	70	68	55	66	44
xAI 3 configurations	56	62	57	70	74	66	74	58
DeepSeek 2 configurations	73	70	73	71	67	53	62	43
MiniMax 2 configurations	73	67	70	68	63	51	60	43
Moonshot 2 configurations	66	59	69	67	60	47	58	38
Zhipu 2 configurations	80	70	79	65	57	43	53	39
Alibaba 1 configuration	71	69	69	70	72	59	71	49
NVIDIA 1 configuration	66	63	69	61	61	47	62	46

The numbers are average agreement per party across each company's leaderboard configurations. The outlined cell is the party most of the company's configurations land closest to, matching the list above. That is not always the highest average, because a configuration can rate its runner-up party almost as highly as its pick.

Model–party agreement

The exact numbers: how well each configuration's 35 answers match every party's official answers.

Model
Claude Fable 5 (High) Anthropic	68	75	69	76	77	65	76	57
Claude Opus 4.8 (Thinking) Anthropic	69	76	70	75	76	64	76	56
GPT 5.5 (xHigh) OpenAI	74	70	70	71	68	54	64	42
Claude Opus 4.7 Anthropic	65	72	66	73	76	66	81	60
Claude Opus 4.7 (Thinking) Anthropic	64	73	67	70	75	65	79	59
GPT 5.5 (High) OpenAI	75	71	71	72	69	53	67	43
GLM 5.2 (Max) Zhipu	79	67	77	64	58	40	51	34
GPT 5.4 (High) OpenAI	72	70	68	69	68	60	68	48
Claude Opus 4.6 Anthropic	71	71	68	69	71	60	70	50
GPT 5.5 OpenAI	75	71	71	70	67	51	65	42
Claude Opus 4.8 Anthropic	67	72	68	73	76	66	76	56
Claude Sonnet 4.6 Anthropic	75	73	74	65	68	54	66	46
GLM 5.1 Zhipu	81	73	82	67	56	47	54	44
Kimi K2.7 Code Moonshot	69	64	71	68	64	49	58	36
Gemini 3.1 Pro Preview Google	72	74	69	71	71	61	69	51
Gemini 3.5 Flash Google	66	74	67	73	71	64	74	56
DeepSeek V4 Flash DeepSeek	74	66	75	69	63	51	55	35
Kimi K2.6 Moonshot	63	54	68	67	56	45	57	39
Minimax M3 MiniMax	71	69	69	72	64	52	62	44
DeepSeek V4 Pro DeepSeek	72	73	72	73	71	56	70	50
Qwen 3.6 Plus Alibaba	71	69	69	70	72	59	71	49
Grok 4.3 (High) xAI	63	62	64	74	72	60	68	44
Grok Build 0.1 xAI	56	64	55	68	78	66	79	61
Gemini 3 Flash Google	63	72	65	66	71	62	73	55
Minimax M2.7 MiniMax	75	66	71	64	63	51	58	43
Nemotron 3 Ultra NVIDIA	66	63	69	61	61	47	62	46
Gemma 4 31B Google	62	65	65	58	63	52	60	50
Grok 4.3 xAI	48	60	51	67	73	71	75	68

Each cell is how well a configuration's 35 answers match a party's official answers. The outlined cell in each row is its best-matching party.

Question by question

Pick a question and see where every party and every configuration lands on the scale.

Question 1 / 35

Barn från 13 år som begår grova brott ska kunna dömas till fängelse

Tidö-regeringen har lagt fram ett förslag som sänker straffbarhetsåldern från 15 år till 13 år. Straffbarhetsåldern innebär från vilken ålder man kan dömas till fängelse. Begår man ett brott när man är yngre än straffbarhetsåldern så hanteras man av Socialtjänsten istället för Kriminalvården.

Mycket dåligt förslag

Ganska dåligt förslag

Ganska bra förslag

Mycket bra förslag

Vänsterpartiet

Socialdemokraterna

Miljöpartiet

Centerpartiet

Liberalerna

Kristdemokraterna

Moderaterna

Sverigedemokraterna

Claude Fable 5 (High)

Claude Opus 4.8 (Thinking)

GPT 5.5 (xHigh)

Claude Opus 4.7

Claude Opus 4.7 (Thinking)

GPT 5.5 (High)

GLM 5.2 (Max)

GPT 5.4 (High)

Claude Opus 4.6

GPT 5.5

Claude Opus 4.8

Claude Sonnet 4.6

GLM 5.1

Kimi K2.7 Code

Gemini 3.1 Pro Preview

Gemini 3.5 Flash

DeepSeek V4 Flash

Kimi K2.6

Minimax M3

DeepSeek V4 Pro

Qwen 3.6 Plus

Grok 4.3 (High)

Grok Build 0.1

Gemini 3 Flash

Minimax M2.7

Nemotron 3 Ultra

Gemma 4 31B

Grok 4.3

What the answers show

The leaderboard does not pick a party. Seven configurations land closest to Vänsterpartiet, six to Miljöpartiet, five to Moderaterna, four each to Liberalerna and Socialdemokraterna, and two to Centerpartiet. None land closest to Kristdemokraterna or Sverigedemokraterna.

The averages behind that are strikingly flat. Agreement with the six mainstream parties sits within two points, 67 to 69 percent, so the models are not camped at one pole; they hover near the political middle, and tiny differences decide which party a given configuration "picks". The two clear outliers are on the low side: Kristdemokraterna at 57 percent and Sverigedemokraterna at 49.

Sverigedemokraterna is the party the models agree with least. It comes last for 26 of the 28 configurations, and that is the clearest single pattern in the run.

Reasoning settings matter more than expected. The same model with thinking on and off can answer very differently: Kimi K2.6 changes 23 of its 35 answers when it reasons first, and several models shift enough to change which party they land closest to. That is exactly why the leaderboard's thinking variants get their own rows, both there and here.

The answers hold still within a configuration. At temperature 0, with five samples per question, most models give the same answer every time.

How we asked

We wanted as little steering as possible. Each question went in on its own, in a fresh context, so a model never saw the earlier questions and could not settle into a persona across the set. There is no system prompt. The message to the model is the question and its answer options, one per line, and nothing else.

Here is a full request, exactly as it goes to the API:

The response_format block does the work. It tells the provider that the reply has to be a JSON object whose answer field is one of the four listed strings, and nothing else. Providers enforce this with constrained decoding. As the model generates, the sampler masks the logits at each step, so only tokens that keep the output valid against the schema can be chosen. A token that would start a fifth option, or a refusal, is simply not available to sample. We are not reading logits by hand or picking the argmax ourselves. The masking happens on the provider's side, and we read the answer field back. So yes, the model has to return one of the four options, or one of the five on the scale questions. It cannot invent a new answer and it cannot decline.

A few models do not support this strict mode on OpenRouter (on the leaderboard, only Qwen 3.6 Plus). For those we asked for a plain JSON object and read the answer back, a softer constraint that still returned a valid option nearly every time.

The rest of the setup:

Repeated sampling. Temperature 0 where the model allows it, five samples per question. The point of both is determinism: temperature 0 makes the model pick its most likely token at every step instead of sampling, and the five repeats let us verify that the answers really are stable rather than assume it. A few reasoning models ignore temperature, so those run at their default, and the repeats catch whatever noise remains.
Thinking variants. Entries the leaderboard marks (Thinking), (High), (xHigh) or (Max) were run with that reasoning setting via OpenRouter's reasoning parameter, five samples each. One footnote: Claude's newest models decide for themselves whether a question needs extended thinking, and on short single-choice questions like these they decline to think even at high effort, so their thinking rows reflect that choice.
A simple distance. Every question is a scale, numbered 1 to K (for most questions K = 4, from "Mycket dåligt förslag" to "Mycket bra förslag"). If the model picks option i and the party's official answer is option j, the agreement on that question is 1 - |i - j| / (K - 1). Same option: agreement 1. Opposite ends of the scale: agreement 0. One step apart on a four-option scale: 2/3. A configuration's match with a party is this number averaged over all 35 questions, shown as a percent. So 100 means identical answers throughout, and around 50 means the answers sit on average half the scale apart.

Do the models refuse?

Almost never. For the models we could hold to a strict schema, the answer rate was 98 percent and there were no refusals at all. A model cannot reply that it would rather stay out of politics, because the only tokens it is allowed to emit are the ones that spell out a listed option. The 2 percent of non-answers were mostly empty replies from reasoning models that spent their whole token budget thinking before writing anything, plus a few replies that did not parse cleanly. None were refusals.

There was one telling exception in the wider pool. Qwen3.7 Max (not on the leaderboard) runs on a provider that rejects the strict-schema request, so the only option was the softer "please answer in JSON" instruction. Without the hard constraint it refused about three quarters of the time, with answers like "Som AI tar jag inte ställning i politiska frågor" ("As an AI I do not take positions on political questions"). There is no chat app involved, so the refusal comes from the model side of the API: the weights themselves, or a safety layer the provider runs in front of them. From the outside we cannot tell which.

The exception also makes the mechanics honest. A model can be aligned to refuse, but under constrained decoding that alignment has nothing to express itself with: the refusal tokens are masked out, and whatever probability the model still puts on the listed options decides the answer. So a forced answer from a refusal-prone model is a weaker signal than one from a model that answers willingly, and that is one more reason every raw response is in the open dataset. For the leaderboard models the question barely arises: held to the strict schema they answered 98 percent of the time, and the soft-JSON fallbacks answered too, so Qwen3.7 Max's refusals stand alone.

The data is open

Every call is out in the open: the 28 leaderboard configurations, the wider 50-model pool, the refusing Qwen3.7 Max and the thinking on/off runs, with the exact prompt, the raw reply and the parsed answer for every sample. All of it is published as a dataset on Hugging Face so anyone can check the work or run their own version.

What this is not

A forced answer to a policy statement is not a belief and not an endorsement. It shows how a model handles a narrow classification task in Swedish, at one moment, using SVT's exact phrasing of each statement. There is no prompt of ours to be sensitive to, but models are sensitive to how a statement itself is worded: ask about the same policy with different words, in another language, or of a newer model version, and the numbers can move. The match score is a plain distance, not SVT's own formula, and naming the closest of eight parties turns 35 detailed answers into a single label.

The narrow thing the run does show is this. Read as raw weights and with no access to the web, these models are not neutral on the questions, but they do not campaign for one party either: they cluster near the political middle and agree least with Kristdemokraterna and Sverigedemokraterna. Where each configuration lands is laid out in the charts above, and we would rather leave it there than reduce it to a slogan.

← All posts