Building Australia's LLM Evaluation Stack: From Imported Scoreboards to AU-Specific Tasks

9 min read

The views in this post are entirely my own and do not represent my employer or any organisation I am affiliated with.


My recent post on benchmark imports landed harder than I expected. The argument was simple: the flagship LLM benchmarks driving Australian agency procurement, MedQA, LegalBench, MMLU, HELM, measure foreign jurisdictions. The next question is: what should an Australian evaluation stack actually test, and how do you build one?

Building Australia's LLM Evaluation Stack

The experiment

Over a weekend I built a proof-of-concept: fifteen frontier models, five Australian legal questions, the same metrics across all of them, and the results were sharper than the argument deserved. Not because five questions can rank fifteen models, but because one model scored 4 out of 5 on “correctness” while denying the existence of a section of the Corporations Act 2001 (Cth) that has been on the books for twenty-five years.

Five questions drawn from Crown copyright Australian legislation: two from the Corporations Act, one from the Privacy Act 1988 (Cth), one from the Australian Consumer Law, one from state succession law. Same prompts, same reference answers, same scoring pipeline. The framework was AWS fmeval running F1, recall, and BERTScore against reference answers, plus Bedrock Evaluations for managed LLM-as-Judge correctness. The same methodology runs equally well on EleutherAI’s lm-evaluation-harness, a popular open-source standard. The evaluation framework is not the scarce resource here. The dataset is.

Five questions cannot rank fifteen models, so I am not publishing the per-model scores. The pattern is still informative: different model families topped different metrics, and the winners were not the largest models tested. Scale was not the dominant factor on Australian legal text.

The hallucination that proves the point

The ignorant judging the equally ignorant. “LLM-as-judge” means using one AI to score another AI’s work: the candidate model writes an answer, a second model reads it alongside a reference answer and assigns a score. Popular because human scoring is slow and expensive; judges scale. The canonical paper is Zheng et al.’s MT-Bench (2023).

Here is what happened. On the insider-trading question, a leading frontier model stated that section 1043A of the Corporations Act 2001 (Cth) does not exist. Section 1043A is the insider-trading provision, on the books for twenty-five years, prosecuted regularly, taught in every Australian commercial law course. The candidate model invented a denial. The judge model read that fabricated denial and gave it 4 out of 5 on correctness. This is the precise judge-failure mode documented in the 2024 literature (Ye et al.; Park et al., EMNLP): judges accept fluent-but-fabricated content.

Both models failed in the same direction because both were trained predominantly on foreign legal text. A LegalBench score, which the dataset card says “primarily contains tasks corresponding to American law,” tells you how a model handles American statutes. It tells you nothing about whether the same model will deny an Australian provision exists; a foreign-trained judge will not raise the alarm.

Curating the dataset

The process splits into two parts: the dataset and the framework. The dataset is the harder half. Framework code exists; AU-grounded test material does not.

A simple benchmark is often a curated set of question-and-answer pairs. The questions are the test; the reference answers are the truth against which the model’s output is scored. Building an Australian benchmark means writing the questions, sourced from primary materials, and writing the reference answers, with domain experts validating both. It is the marking key for an exam where the candidates are language models.

A procurement-grade dataset starts with primary materials. For legal, that means Crown copyright legislation from the Federal Register of Legislation and AustLII. Isaacus’s Open Australian Legal Corpus is 229,122 texts and 1.4 billion tokens: every in-force statute and regulation across Commonwealth and six state jurisdictions, scraped with permission, normalised, and licensed openly. That methodology (scrape from primary registers, normalise, license openly) is reproducible for other domains, like education and health, whose primary materials are public.

Question generation has three approaches: subject-matter experts writing each question from scratch (slow but highest quality); algorithmic generation seeded by primary materials (fast but requires validation); or a hybrid. AusLawBench is the SME-authored example for Australian legal: 55,000 instances and 18,677 unique citations from Monash and UCL, open on Hugging Face. SeaExam is the hybrid example for Indonesian, Thai and Vietnamese: an industry-academic partnership worth studying as a template for any equivalent Australian work.

Validation and versioning of the dataset are non-negotiable. Without validation, the scoring key itself may be wrong. Without versioning, scores from different implementations are not comparable. Domain experts review every question; the dataset is versioned on three axes: corpus, task-set, evaluation-protocol. Stanford CRFM’s HELM (Holistic Evaluation of Language Models) is the cleanest versioning example. MMLU (Massive Multitask Language Understanding) is the cautionary tale: three different scoring implementations have produced three different rankings of the same models.

Simple evaluations and complex evaluations

The framework half splits across three operations, not two. (The carve-up is operational, complementary to HELM’s metric-axes split or the lm-evaluation-harness’s output-type taxonomy.) Mechanical scoring measures string or embedding overlap. Verification looks an answer up against an authoritative source. Judge-based evaluation asks a model to assess another model’s output. Each operation catches a different kind of failure, so a procurement-grade harness uses all three.

Mechanical scoring is the spelling-test marker. Either the candidate’s answer matches the key or it does not. fmeval’s metrics (F1, exact match, BERTScore) work exactly this way. Excellent at confirming dates, codes, numbers, names. Useless at catching a wrong-but-fluent answer. A model that denies section 1043A exists produces fluent, coherent text. F1 and BERTScore rate it respectably. The failure is factual; mechanical scoring is blind to it.

Verification against an authoritative source is the fact-checker who phones the source. “Did the police actually say that?” “Is that section actually in the Act?” No interpretation, no scoring rubric, just a lookup. The question “does section 1043A of the Corporations Act 2001 (Cth) exist?” has a boolean answer from a single call against the Federal Register of Legislation. No LLM involved. This is the tier that would have caught the s.1043A failure in milliseconds. FActScore (Min et al., 2023) is the closest published precedent: atomic-fact decomposition with retrieval-based verification.

Grounded judgment is the clever friend with the book opened to the right page. The judge receives retrieved authoritative text alongside the candidate output and assesses whether the proposition is supported by the source in front of it. A procurement-grade legal pipeline composes all three: extract the citation from the candidate’s output; check it against the Federal Register or AustLII (verification, no LLM); retrieve the relevant text; ask a grounded judge whether the proposition holds.

Ungrounded judgment is the same clever friend with a sticky-note instead of the book. The judge sees the candidate’s answer and (often) a reference answer, but no retrieved primary-source text. Bedrock Evaluations ships this as the default: Builtin.Correctness takes a candidate response and a reference answer, and the judge applies its own parametric knowledge to produce a 1–5 score. When the judge and the candidate share blind spots, both trained predominantly on foreign legal text, they can both miss the same fabrication, because the judge isn’t reading the reference adversarially against its own priors. That generality is the design choice that makes the service useful as a general-purpose tool: around a dozen built-in LLM-as-Judge metrics, each with a stable Builtin.<Name> identifier including Builtin.Correctness, defined by AWS as “Measures if the model’s response to the prompt is correct.” A judge hard-coded to one jurisdiction’s rules would be useless everywhere else. The generality is the feature.

The architecture generalises

The verification-plus-grounded-judgment pattern travels. The methodology is the same across domains; what changes is the substrate and the ratio of verification-to-judgment.

Medical.PBS, TGA ARTG, MBS, AHPRA register, and the Australian Immunisation Handbook are the verification layer: authoritative, structured, machine-accessible. Verification answers boolean questions: does this PBS item exist? Is this drug TGA-registered for this indication? Is this MBS item billable for this provider type? This tier catches a model that fabricates a TGA indication. The cost of a hallucinated drug interaction is patient harm, not a “4 out of 5”.

Education.ACARA’s Australian Curriculum, training.gov.au, the AQF, and the TEQSA and ASQA registers are the verification layer. A curriculum code either exists at the claimed year level or it does not. Grounded judgment then checks whether an assessment task aligns to a retrieved achievement standard. Pedagogy is interpretive; curriculum codes are not.

The verification-to-judgment ratio varies by domain. Legal sits roughly half and half. Medical needs both tiers heavily: dense authoritative registers for verification, complex interpretive questions for judgment. Education leans toward judgment.

What’s happening overseas

Australia does not have to start from scratch. Three international institutions are already building versions of this work, each offering a template the AISI (Australian AI Safety Institute) or a partner could emulate.

The Allen Institute for AI (Ai2) is the federally-backed compute-consortium shape. In late 2025, Ai2 announced the OMAI cluster: a $152M partnership with the US National Science Foundation and Nvidia to build open multimodal foundation models, anchored by the OLMo language model family and Molmo robotics work. Ai2 is a 501(c)(3) non-profit. The institutional pattern (non-profit anchor, federal compute, openly licensed artefacts) is the most directly transferable of the three templates.

The EleutherAI Institute is the lean policy-and-research shape. EleutherAI became a 501(c)(3) in 2025 after years as an open community group, running on under $3M annually with backing from Hugging Face, Stability AI, Lambda and Canva. Its focus is AI decision-making research, training datasets, and global policy on safety and transparency. An AU researcher or institution can contribute directly.

The AI Alliance is the multi-stakeholder federated shape. Founded by Meta and IBM with 200+ members, the Alliance launched Project Tapestry on 7 April 2026: a federated training platform on which institutions co-train a shared foundation model while retaining sovereign derivatives. Yann LeCun is the project’s Chief Science Advisor. Project Tapestry directly overlaps Australia’s sovereign-AI ambition and is worth tracking.

A possible future

I am not a legal, medical, or education subject matter expert. This work demonstrates the technical methodology for building jurisdiction-specific AI evaluations and benchmarks. Domain experts must own the creation, validation, and maintenance of production-grade evaluation datasets.

What is missing is coordination across a patchwork of government agencies, and clarity on who owns each piece. We can learn from deployment models overseas.

The same three-tier architecture can be extended to RAG pipelines (the model receives retrieved authoritative text at generation time) and to agents (the model takes actions in the world).

Imagine an Australia where every public-facing AI system has been tested for accuracy in our own context.