Back to jobs
New

Senior Technical Operations & Deployment Engineer (GPU Cloud Infrastructure)

Europe

About Radian Arc.
Radian Arc provides an infrastructure-as-a-service (IaaS) platform for running cloud gaming, artificial intelligence and machine learning applications inside telecommunication carrier networks. Our teams across the USA, Australia, Central Europe, Malaysia, Singapore and Japan offer telecom operators a GPU-based edge computing platform without the need for capital expenditure, facilitating low latency and improved economics for value-added services and the monetization of 5G investments.

What impact you will have

Mission: Install, validate, operate, and maintain regional and core GPU cloud deployments across datacenter environments.

This role owns the practical deployment and operational readiness of the platform stack in the field. It bridges infrastructure engineering, datacenter operations, networking, host systems, storage, and platform operations. The engineer is responsible for taking a validated architecture and BOM and turning it into a working production environment, including physical deployment coordination, rack and cabling validation, switch and host bring-up, firmware and BIOS validation, operating-system installation, GPU and DPU validation, storage integration, platform stack installation, acceptance testing, and operational handover.

The role is intentionally hands-on and cross-domain. It is not a pure datacenter technician role, and it is not a pure platform engineering role. It is the person who can work in a datacenter, understand cabling and optics, debug host and network issues, validate GPU servers, support platform installation, and coordinate with engineering when problems arise.

The role is especially important as the platform evolves from regional deployments toward HGX-based GPU systems, east-west fabrics, and core AI infrastructure. This profile is focused on actual installation, commissioning, maintenance, and operational readiness.

What you’ll do

Datacenter deployment and commissioning

  • Coordinate physical deployment activities with datacenter providers, integrators, logistics teams, and internal engineering.
  • Validate rack layouts, elevations, power feeds, airflow assumptions, cable paths, and labeling before installation.
  • Support rack-and-stack activities for GPU nodes, CPU nodes, storage nodes, switches, firewalls, routers, serial/OOB equipment, PDUs, and supporting infrastructure.
  • Validate fibre and copper cabling against the deployment design, including OOB, north-south, east-west, storage, and management networks.
  • Validate optics, transceivers, link speeds, breakout cables, port mappings, and redundancy assumptions.
  • Maintain accurate as-built documentation, including rack elevations, cable maps, port maps, serial numbers, asset records, IP allocations, and change records.

Host and hardware bring-up

  • Bring up GPU servers, platform servers, storage nodes, and supporting infrastructure.
  • Validate BIOS, BMC, firmware, NIC, DPU, GPU, NVMe, RAID/HBA, and platform firmware versions.
  • Configure and validate BMC access using Redfish/IPMI and OOB management networks.
  • Validate GPU visibility, PCIe topology, NUMA layout, thermals, power behavior, and hardware health.
  • Run hardware acceptance tests, burn-in tests, GPU stress tests, network tests, and storage validation before handover.
  • Troubleshoot hardware issues across servers, GPUs, DPUs, NICs, optics, cables, disks, memory, firmware, and BIOS.

Network deployment support

  • Support deployment and validation of OOB, north-south, storage, and east-west networking.
  • Work with networking engineering to apply and validate switch configurations.
  • Validate BGP, ECMP, VLAN/VRF segmentation, EVPN/VXLAN where applicable, OVS/OVN integration, and routing reachability.
  • Validate VyOS routers, OOB firewalls, transit routers, Citrix NetScaler/WAF, and customer connectivity.
  • Support RoCE/RDMA fabric validation for distributed AI workloads where applicable.
  • Troubleshoot practical network issues such as link flaps, optics issues, incorrect polarity, MTU mismatches, route leaks, VLAN errors, packet loss, PFC/ECN issues, and fabric congestion.
  • Support integration with NVIDIA Cumulus / Spectrum-X environments, and assist with Cisco or SONiC-based alternatives if those become part of the roadmap.

Platform stack installation and validation

  • Support installation and validation of the platform stack across regional and core deployments.
  • Install and validate host operating systems, kernel versions, NVIDIA drivers, Mellanox/NVIDIA OFED or inbox drivers, CUDA compatibility, Docker/containerd, KVM/QEMU, and platform agents.
  • Support CloudStack-based deployments and Kubernetes/KubeVirt-based deployments.
  • Validate GPU passthrough, SR-IOV, BlueField NIC/DPU behavior, VM networking, and container networking.
  • Support Kubernetes node registration, GPU Operator validation, CSI validation, CNI validation, and node lifecycle workflows.
  • Support storage integration with StorPool, Weka, local NVMe, or other supported storage platforms.
  • Execute acceptance tests and produce deployment readiness reports.

Operational maintenance and Day-2 support

  • Perform controlled maintenance activities such as firmware upgrades, switch upgrades, host OS updates, GPU driver updates, BIOS changes, and hardware replacements.
  • Support incident response for infrastructure issues affecting GPU nodes, hosts, networking, storage, or platform components.
  • Perform root-cause analysis for deployment and operational failures.
  • Maintain runbooks for installation, validation, upgrade, rollback, troubleshooting, and handover.
  • Work with engineering to turn repeated operational issues into automation, better validation, or platform improvements.
  • Participate in on-call or escalation rotations for regional and core environments where appropriate.

Platform observability and validation

  • Ensure telemetry is correctly configured for hosts, GPUs, DPUs, switches, storage, OOB devices, and platform components.
  • Validate Zabbix, Prometheus, Grafana, Loki, DCGM/NVML, NVIDIA NetQ or equivalent telemetry sources.
  • Confirm that deployment health checks, hardware alerts, performance dashboards, and operational alarms work before production handover.
  • Support performance baseline testing for GPU, network, storage, and host layers.
  • Assist with NCP-related validation and benchmarking.

Cross-team coordination

  • Work closely with the Senior Director of Infrastructure Operations.
  • Work closely with Staff Network, Staff Storage, Sr Hardware/Infrastructure, Sr Platform, Sr Fleet Automation, Observability, Product Engineering, Sales Engineering, and Service Delivery roles.
  • Provide field feedback into reference architectures, BOMs, rack layouts, cabling standards, deployment playbooks, and validation procedures.
  • Coordinate with external vendors including datacenter providers, systems integrators, server vendors, storage vendors, NVIDIA, and networking vendors.
  • Act as the practical field escalation point when architecture, BOM, datacenter conditions, and platform implementation do not align.

Technical Stack

Hardware and datacenter

  • GPU servers: L40S, RTX 6000 Pro, H200, B200/B300-class systems, HGX systems, and future NVL72-style rack-scale systems.
  • CPU/platform servers.
  • Storage nodes and JBODs.
  • PDUs, BMCs, serial/OOB, firewalls, routers, switches.
  • Rack layouts, power feeds, airflow, cold/hot aisle containment, cabling, optics.
  • DTC/DLC cooling.

Host and systems

  • Ubuntu Linux.
  • Linux networking.
  • BIOS/BMC/firmware lifecycle.
  • Redfish, IPMI.
  • NVIDIA drivers, CUDA, DCGM/NVML.
  • Mellanox/NVIDIA NICs, BlueField DPUs.
  • KVM/QEMU, VFIO, PCI passthrough.
  • Docker/containerd.

Networking

  • NVIDIA Spectrum/Cumulus.
  • VyOS.
  • OVS/OVN.
  • BGP, ECMP, VLAN, VRF, EVPN/VXLAN.
  • RoCE/RDMA.
  • SR-IOV.
  • Citrix NetScaler / WAF.
  • OOB and break-glass access.

Platform

  • CloudStack.
  • Kubernetes.
  • KubeVirt.
  • NVIDIA GPU Operator.
  • CSI/CNI integrations.
  • StorPool, Weka, local NVMe.
  • Terraform, Ansible, Bash, Python.

Observability

  • Zabbix.
  • Prometheus.
  • Grafana.
  • Loki.
  • DCGM Exporter.
  • NVIDIA NetQ or equivalent.
  • Logs, metrics, hardware health, and deployment validation dashboards.

What you'll need

  • Strong hands-on experience deploying and maintaining datacenter infrastructure.
  • Experience with GPU, HPC, AI cloud, private cloud, or high-density compute environments, including both air-cooled and liquid-cooled deployments.
  • Experience with DTC / DLC cooling concepts, including rack manifolds, CDU integration, facility water-loop requirements, coolant specifications, flow/pressure/temperature validation, leak detection, and cooling redundancy.
  • Familiarity with NVL72-style rack-scale architectures, including liquid-cooled GPU trays, NVLink/NVSwitch domains, in-rack networking, high-density power delivery, and OEM/NVIDIA validation requirements.
  • Comfortable working across physical infrastructure, Linux hosts, networking, storage, and platform software.
  • Experience bringing up servers from bare metal through production readiness.
  • Experience with rack layouts, cabling, optics, transceivers, power, OOB management, and datacenter handover.
  • Strong Linux troubleshooting skills.
  • Practical networking knowledge across VLANs, VRFs, BGP, ECMP, OVS/OVN, and routing.
  • Experience with GPU servers, NVIDIA drivers, firmware, PCIe topology, and hardware validation.
  • Experience with infrastructure automation using Bash, Python, Ansible, Terraform, or similar tooling.
  • Strong documentation discipline and ability to produce accurate as-built records and runbooks.
  • Ability to validate practical datacenter readiness for current and next-generation AI infrastructure, including power density, cooling model, rack depth/width, floor loading, containment, serviceability, and maintenance access.

Personal qualities:

  • Very hands-on and practical.
  • Comfortable working in datacenters and remotely with smart-hands teams.
  • Strong troubleshooting mindset across hardware, network, host, and platform layers.
  • High attention to detail around cabling, labeling, asset records, and validation.
  • Calm under pressure during deployment windows and customer-impacting incidents.
  • Able to distinguish between a temporary field workaround and a permanent engineering fix.
  • Good communicator who can explain field issues clearly to engineering and leadership.

What we offer
• Attractive compensation package reflecting your expertise and experience.
• A great work environment characterised by friendliness, international diversity, flexibility, and a hybrid-friendly approach.
• You'll be part of a fast-growing scale-up with a mission to make a positive impact, offering an exciting career evolution.

Our job titles may span more than one job level. The actual base pay is dependent on a number of factors, such as transferable skills, work experience, business needs and market demands.

Our inclusive responsibility
Radian Arc is committed to creating a diverse and inclusive environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, veteran status, or any other protected category under applicable law.

Apply for this job

*

indicates a required field

Phone
Resume/CV*

Accepted file types: pdf, doc, docx, txt, rtf

Cover Letter

Accepted file types: pdf, doc, docx, txt, rtf


Select...
Select...