Infrastructure Engineer
Infrastructure Engineer – Platform
Company
Orcrist is building a next generation data intelligence platform using cutting-edge technologies. We’re handling petabyte-scale data with sub-second queries. Our product is a Kubernetes-based platform delivered as B2B SaaS or as a self-hosted on-prem solution, including air-gapped deployments. We enable customers across defense, law enforcement, and enterprise to turn mission-critical data into actionable intelligence. Our Platform team owns the infrastructure that powers every deployment, from the metal up.
Role
Kubernetes runs on something, and that something is yours. You’ll own the layer beneath our platform: bare-metal GPU servers, operating systems, networking, and storage across on-prem and fully air-gapped sites. You design, build, and operate GPU server fleets and the NVIDIA software stack, then partner with our SRE and ML teams to deliver fast, reliable on-prem inference. Some of this work is hands-on at customer sites, where you size, rack, and commission self-contained server environments that run with no internet uplink.
What you'll do
- Design, size, provision, and operate bare-metal GPU server fleets across on-prem and air-gapped environments (firmware/BIOS, BMC via Redfish/IPMI, OS, drivers) with zero-touch provisioning (PXE/iPXE, MAAS/Metal3/Tinkerbell) and automation (Ansible/Salt, Terraform/Pulumi).
- Own the NVIDIA GPU stack end to end: drivers, CUDA, GPU Operator, Container Toolkit, MIG, and DCGM, tuned for inference throughput, latency, and utilization.
- Build the bare-metal substrate Kubernetes runs on: node lifecycle, container runtime, GPU device plugins, node feature discovery, and kernel/NUMA tuning.
- Engineer data-center networking and resilient storage (VLANs/switching, RDMA, Ceph/ZFS/NVMe) sized to scale without replacing the core, with encryption at rest.
- Partner with ML and MLOps on on-prem inference serving (Triton, KServe, vLLM): model deployment, GPU scheduling and sharing, and performance tuning.
- Plan and run on-site build-outs: rack integration, power/UPS and cooling sizing, commissioning, capacity planning, runbooks, and operator handover.
About You
- 5+ years in bare-metal, HPC/GPU, data-center, or systems infrastructure engineering, with hands-on ownership of physical and compute infrastructure.
- Strong bare-metal Linux (RHEL/Rocky/Ubuntu): firmware, BMC, PXE, kernel and storage tuning, plus solid networking and storage fundamentals.
- Real experience with the NVIDIA GPU stack (drivers, CUDA, GPU Operator, MIG, DCGM) and serving GPU models in production.
- Comfortable operating in air-gapped or on-prem environments and traveling to customer sites for builds and deployments.
- Documentation-focused, methodical, and calm during hardware incidents. Eligible to work in Germany.
Nice‑to‑haves
- German language (B1+), NVIDIA DGX/HGX or Slurm experience, InfiniBand/RDMA fabrics, and inference optimization (TensorRT-LLM, vLLM, quantization).
- Certifications such as NVIDIA NCP-AIO, Red Hat RHCSA/RHCE, or CKA/CKS.
- Field-engineering experience and familiarity with secure or regulated deployment environments.
What We Offer
- Modern architecture & stack.
- Remote‑first in Germany with occasional team events in Berlin.
- Home office budget and great equipment.
- 30 days vacation.
- Direct impact on critical missions across private and public‑sector customers.
Apply for this job
*
indicates a required field
