Building National Foundation Models: What It Takes, and What Actually Works

My previous post explored why Australia should pursue sovereign AI capability. This time, the focus is on how to actually build national foundation model capability: the services, decisions, and real-world examples that set successful teams apart.

What Does It Take? Real-World, Large-Scale Examples

organisations pursuing advanced AI capabilities—whether national projects or startups—often choose a variety of approaches. These range from on-premises data centers to cloud or multi-cloud solutions. Each option brings trade-offs in AI innovation, compliance frameworks, and cost models. The examples and recommendations below highlight how AWS infrastructure supports national AI model building. If you want to build from scratch—not just fine-tune off-the-shelf models—you need scale, reliability, and the right cloud-managed services. Consider these leaders:

Climate Tech Startups: As detailed here, climate innovators use Amazon SageMaker HyperPod to train custom models for environmental forecasting, advanced material discovery, and carbon management. With on-demand access to thousands of GPUs, teams move from research to results rapidly—no local hardware roadblocks, all while running on infrastructure with a shrinking carbon footprint thanks to AWS renewables.
Singapore SEA-LION:SEA-LION (multilingual LLMs for Southeast Asia) was trained entirely on AWS with managed GPU clusters. Their team reports completing a 3B-parameter model in three months; scaling is unconstrained by traditional hardware shortages. Project summary here.
UAE TII Falcon:Falcon models were built from scratch on Amazon SageMaker, using distributed orchestration across thousands of GPUs. By using fully managed cloud resources, TII avoided the delays and scaling limits of on-premises setups—now, their models are accessible worldwide via SageMaker JumpStart and Bedrock Marketplace.

The Challenge of Owning GPUs On-Premises

Building large models on-premises isn’t just about buying enough servers. GPU hardware has a much shorter lifecycle than standard enterprise CPUs, rapidly evolving as AI performance demands climb. GPU-based servers may need to be refreshed every 2-3 years (or sooner to keep up with new model architectures), while typical CPU-based infrastructure often runs much longer before deprecation or repurposing. This means organisations running on-prem must commit to frequent, expensive hardware refresh cycles and plan for accelerated deprecation—just to stay competitive. And each new generation of GPU launches with global demand outstripping supply.

What Do Model Builders Actually Need?

When organisations set out to build large language models from the ground up, the overwhelming demand from data scientists and ML engineers is this: they want to spend time mastering data and models—not wrestling with hardware, networking, security frameworks, or resource management. The real value is unlocked by tuning architectures, experimenting with new training recipes, and iterating rapidly—not troubleshooting servers or chasing compliance certifications.

Meeting this demand requires tackling several complex infrastructure and assurance needs:

High-throughput, scalable storage ensures massive datasets and rapid checkpoints never become a bottleneck. In the cloud, this is delivered by Amazon FSx for Lustre, providing fast, parallel file system access at scale.
Fast, reliable networking enables data to move between thousands of GPUs in sync; Elastic Fabric Adapter (EFA) provides the low-latency, high-bandwidth connections that distributed ML workloads require.
Seamless management of massive GPU clusters is needed as experiments get bigger and more complex; SageMaker HyperPod lets teams automate and manage distributed training, freeing up specialists for actual model work.
Predictable, cost-effective GPU access is achieved through ML Capacity Blocks, allowing teams to reserve scalable GPU capacity exactly when needed.
Durable, accessible storage for all model outputs, checkpoints, and data is handled by Amazon S3.
It’s increasingly common for organisations to design AI systems with portability in mind, leveraging open standards and tools that simplify data migration between environments as needed. Technology and market offerings shift rapidly; teams frequently reassess their infrastructure choices to take advantage of innovation, competition, and changing compliance standards
Confidence in security and compliance is essential for regulated and government workloads. Using AWS infrastructure certified at IRAP PROTECTED level, organisations in Australia can launch AI workloads with full assurance that their environment already meets the country’s highest set of public sector requirements. This removes roadblocks, accelerates project onboarding, and ensures continued alignment with national privacy and security priorities.
Australian organisations can store and process their data entirely within Australian AWS Regions. All data used to train national AI models remains on Australian soil, under Australian jurisdiction, meeting the most stringent data residency and privacy requirements.
AWS Trainium2 delivers 30-40% better price performance than current GPU-based EC2 instances for training and deploying large language models with hundreds of billions to trillion+ parameters.

Foundation Models - Better in The Cloud

In every case, these supporting AWS services and compliance frameworks let data scientists and national AI builders focus on what they do best: harnessing local data and modeling expertise—while the complexity of infrastructure, security, and regulation stays invisible in the background.

Australian Investment: World-Class Infrastructure with Sustainability at Scale

Training large-scale GenAI models is energy intensive—running thousands of GPUs for weeks at a time quickly consumes large amounts of electricity. When organisations try to handle these workloads on-premises, the result is typically less efficient use of power, higher cooling requirements, and a much larger carbon footprint—especially when energy is drawn from traditional grids or older facilities.

In contrast, running GenAI workloads in the cloud leverages the global efficiency and renewable investments of hyperscale providers. AWS, for example, is investing AU$20 billion into expanding local data centers (PM’s announcement here)—not only to build AI-ready infrastructure in Melbourne and Sydney but also to operate it as sustainably as possible. This expansion features the addition of three new Amazon solar farms, bringing total number of Australian renewable projects to eleven and generating enough carbon-free energy to power nearly 290,000 homes each year.

AWS’s efficiency extends to water management as well. In Australia, AWS data centres use no water for cooling 95.5% of the year, relying instead on free-air cooling that uses outside air directly. During peak summer temperatures, evaporative cooling uses water drawn from on-site storage reserves fed by recycled or mains water. In 2024, AWS’s Water Usage Effectiveness (WUE) in the Sydney Region was 0.12 L/kWh and Melbourne Region just 0.02 L/kWh—literally one tablespoon per kilowatt hour—less than half the US industry average of 0.375 L/kWh. This efficiency has helped AWS reach 53% of its 2030 global goal of being “water positive,” meaning it will replenish more clean water to communities than it uses in operations.

With cloud, organisations gain access to world-class infrastructure while helping reduce overall environmental impact. Doing the same work on-premises means higher emissions and less transparency over energy sourcing.

Even the Biggest Model Builders Choose Cloud

The scale and reliability required for frontier AI development has led the world’s leading AI companies to build on AWS infrastructure, demonstrating that cloud is the de facto choice for serious model builders:

Anthropic and Project Rainier: AWS completed Project Rainier in October 2025—one of the world’s largest AI compute clusters featuring nearly 500,000 Trainium2 chips across multiple US data centers, deployed in under a year. Amazon invested $8 billion in Anthropic, which is actively using Project Rainier to train and deploy Claude, with plans to scale to over 1 million chips by year’s end. This infrastructure provides Anthropic with more than five times the compute power used to train previous AI models, demonstrating AWS’s ability to deliver massive AI infrastructure at unprecedented speed.
OpenAI’s $38 Billion AWS Deal: In November 2025, OpenAI signed a $38 billion, seven-year deal with AWS to access hundreds of thousands of Nvidia GPUs, marking the ChatGPT maker’s first major partnership with AWS. The deal represents OpenAI’s commitment to scaling its infrastructure through proven cloud providers rather than attempting to build and manage data centers independently.

These partnerships underscore a critical reality: even organisations with virtually unlimited capital and the world’s top AI talent choose to build on cloud infrastructure rather than managing their own data centers. The complexity, speed, and scale required for frontier AI development makes cloud the only practical path forward—for national AI builders in Australia just as much as for Silicon Valley AI labs.

The Edge Case: National Security and Intelligence

A small set of defence and intelligence projects truly need control of every hardware and software layer. For those, AWS is building a dedicated Top Secret Cloud for the Australian Government and Signals Directorate.

Summary: What It Really Takes to Build National AI capability

Building national AI capability isn’t theoretical—it’s about execution. Nearly all use cases in government, research, health, and business can build and scale securely using cloud infrastructure. Local investment, IRAP PROTECTED compliance, and massive renewable energy backing make it even more compelling.

For those considering on-premises options, remember: the global GPU supply chain is fiercely competitive, and maintaining an up-to-date GPU fleet means frequent—and costly—hardware refreshes.