How Aviato Consulting Built a 1,000-Core Supercomputer on Demand for a Global Engineering Leader
In High Performance Computing (HPC) elasticity is a game changer. It’s the difference between an engineering team waiting an entire week for simulation results or getting them back before their afternoon coffee break.
Aviato Consulting recently led a digital transformation project for a top engineering simulation firm, moving their most important workloads to Google Cloud. By migrating from aging on premises hardware to a scalable Google Cloud environment, the client eliminated their infrastructure bottlenecks.
Executive Summary at a Glance
We deployed a scalable computing cluster that gives a global engineering firm instant access to over 1,000 CPUs for their heaviest simulation workloads.
- Client Industry: Industrial Engineering / Manufacturing
- Business Need: Eliminate on premises infrastructure bottlenecks, fragile license management, and prohibitive queue latencies for Siemens STAR CCM+ and FEA solvers.
- Solution: Automated Slurm cluster architecture built using the Google Cloud HPC Cluster Toolkit.
- Key Impact:
- Increase in Simulation Speed: Utilising 4th Gen Intel Xeon Scalable (Sapphire Rapids) nodes.
- Reduction in Operational Costs: Implementation of Scale to Zero automation.
- Reduction in Queue Wait Times: On demand provisioning eliminated days of waiting for physical hardware availability.
From Capacity Limits to Performance on Demand
About the Client: With over two decades of engineering experience across multiple industries. They specialize in complex areas like fluid dynamics (CFD), and structural analysis (FEA). Companies around the world rely on their expertise to make sure their most critical design and engineering decisions which are backed by solid data.
The Goal: The client needed a setup where engineers could run massive simulations and complex calculations instantly without worrying about hitting physical server limits or spending time managing a data center.
The Challenge
Our analysis identified three distinct hurdles that were slowing down the operational efficiency:
- Oversubscribed On Premises Systems: Existing hardware was operating beyond its intended capacity, leading to “queue gridlock” where critical projects were delayed by days or weeks.
- Fragile Licence Management: Legacy scheduling methods created significant compliance risks and administrative overhead when attempting to scale or reallocate resources across a global team.
- Operational Expenditures (OpEx): The cost of maintaining aging on site systems, including power, cooling, physical security, and manual maintenance, had become high.
Our Solution
Aviato Consulting chose a strategy rooted in Infrastructure as a Code (IaaC), utilising the Google Cloud HPC Cluster Toolkit.
The “Pay as You Go” Supercomputer: A Google Cloud Success Story
We used Google’s specialized C4 chips and high speed Filestore to get the best possible speed for the lowest price. Unlike traditional setups that lock you into expensive contracts, this Google Cloud model lets the client pay only when they are actually running simulations.
Hardened Network Security & Centralised Orchestration
- Secure Access: We designed a VPC with private subnets. Engineers access the cluster via Identity Aware Proxy (IAP), providing secure SSH access.
- Centralised Licensing: We implemented a centralised license server, allowing nodes to fetch updates and licenses securely without being exposed to the open internet.
- Version Flexibility: The environment was engineered to be compatible with all versions of Siemens STAR CCM+. We implemented an automated image pipeline that allows for “zero touch” auto upgrades of STAR CCM+ versions, ensuring the engineering team always has access to the latest solver optimisations.
Multi Tiered Costing & Workload Optimisation
To save the client money, we matched every task to the right level of power. Small jobs ran on cheaper settings, while heavy simulations used high performance power, ensuring they never overpaid for Google Cloud space.
- The “Accelerator” Tier (GPU Workloads): Utilising G2 instances (NVIDIA L4 GPUs) for accelerated solvers and visualisation.
- The “Sprint” Tier (C4D / High Compute CPU): Powered by c4d highcpu 384 (Sapphire Rapids) nodes. These are the gold standard for Siemens solvers, offering the massive memory bandwidth required for fluid dynamics.
- The “Marathon” Tier (N2D / Cost Effective CPU): Mapped to N2D Standard VMs. By utilizing Google’s spare capacity, we delivered massive savings for less urgent parameter sweeps where cost per core is the priority.
Specialised Accelerators
The Aviato “Cloud Burst” Framework: Using the Toolkit, we deployed a dedicated Slurm Controller and Login Node that abstracts the cloud’s complexity.
- Accelerator Name: GCP HPC Cluster Toolkit – Effort Saved: Hours of manual infrastructure configuration.
- Accelerator Name: Aviato Elastic Scaling Scripts – Effort Saved: Hours in manual job management.
Results & Business Impact
By leveraging specialised infrastructure and Aviato’s automation, we delivered a system that is as powerful as a supercomputer but as flexible as a startup.
Does it get queued?
In a traditional environment, “queuing” means your job sits idle until someone else finishes. In our solution, while a “queue” exists in Slurm, it acts as a trigger rather than a barrier.
- The Magic: When an engineer submits a 1000 core job, the Slurm Controller triggers a ResumeProgram script via the Google Compute Engine API.
- Provisioning: GCP spins up the required nodes in minutes. They boot, mount the storage, and execute.
- Total Wait: The wait time has been reduced from days of hardware unavailability to minutes of boot time.
Profits & Personnel Shift
- Cost Efficiency: “Scaling to zero” ensures the client stops paying the moment a job finishes. The use of N2D VMs for the “Marathon” tier reduced compute costs for non urgent jobs.
- Operational Profit: The shift from CapEx to OpEx freed up capital for R&D.
- Personnel Change: The engineering team has been liberated from “hardware troubleshooting.” IT personnel have shifted from managing physical servers to optimising simulation parameters and data science models, increasing the firm’s overall high value output.
The Power of Partnership
Aviato Consulting was the right fit for this transformation due to our unique intersection of cloud expertise and industrial engineering knowledge:
- Domain Expertise: Deep understanding of the Siemens STAR CCM+ ecosystem and the specific memory bandwidth requirements of CFD/FEA solvers.
- Technical Proficiency: Advanced Google Cloud certifications and proven experience with the HPC Cluster Toolkit, Terraform, and automated software deployment pipelines.
Strategic Advantages (CTA)
HPC in the cloud is about re architecting for agility. By partnering with Aviato Consulting, you gain:
- Cloud Native Expertise: Scalable, secure, and cost optimised architectures designed for the most demanding solvers.
- Future Proof Infrastructure: Automated upgrades and access to the latest silicon (Intel Sapphire Rapids, NVIDIA L4 GPUs) without the hardware refresh cycle.
Conclusion
Aviato Consulting modernized a global engineering firm’s simulation process by moving their heavy workloads from aging, slow on premises servers to an automated Google Cloud environment. By replacing queue gridlock with an elastic system that scales to over 1,000 CPUs instantly, engineers can now run complex fluid dynamics and structural tests on demand rather than waiting days for hardware to become available.
This migration served as a critical capability upgrade, allowing the engineering team to accelerate their R&D cycles without being constrained by fixed compute capacity. Financially, this allowed the client to transition from a rigid CapEx hardware refresh cycle to a flexible OpEx model, significantly reducing idle compute waste while bursting to 1000+ cores only when required. Ultimately, the firm traded hardware headaches for a scalable supercomputer that only costs money when it’s actually working.
Legacy HPC infrastructure shouldn’t be the bottleneck for engineering innovation. If your solver queue times are impacting project delivery, let’s discuss how a tailored Google Cloud architecture can resolve it.