About the job
PrimaLabs builds systems that help enterprises run large-scale AI workloads efficiently on real hardware. Our focus is on optimizing inference performance, cost, and reliability across modern accelerator platforms. We work directly with enterprise customers deploying frontier models on next-generation GPUs and AI accelerators. Our platform continuously discovers optimal runtime configurations to ... maximize throughput, reduce latency, and improve cost efficiency. Role Overview PrimaLabs is hiring a Senior ML Systems Engineer to own the optimization engine that runs on real customer hardware. You will work on tuning and benchmarking inference systems across GPUs like NVIDIA H200 Tensor Core GPU, NVIDIA B200 Tensor Core GPU, and AMD Instinct MI300X. Your work will power PrimaLabs’ automated optimization stack, including runtime tuning, benchmarking pipelines, and integration with large-scale hyperparameter search frameworks such as DeepHyper. You will also work directly with customers during deployments, ensuring our system delivers measurable performance gains on real production infrastructure. Key Responsibilities Inference Runtime Optimization • Tune and optimize inference systems using vLLM and SGLang • Profile model performance across different hardware and runtime configurations • Identify and eliminate performance bottlenecks (memory bandwidth, kernel inefficiencies, batching behavior) Benchmarking & Performance Analysis • Design and execute benchmark suites for real customer workloads • Measure throughput, latency, memory utilization, and cost efficiency • Build standardized benchmarking frameworks for new models and hardware Optimization Infrastructure • Build systems for large-scale configuration sweeps and automated tuning • Integrate runtime parameters, hardware constraints, and workload characteristics into search pipelines • Maintain and extend the DeepHyper-based optimization pipeline Customer Deployments • Work directly on enterprise deployments running on modern AI accelerators • Support benchmarking and optimization during customer onboarding • Deliver performance improvements tailored to customer hardware environments Hardware-Aware Systems Engineering • Optimize workloads across GPUs including: • NVIDIA H200 Tensor Core GPU • NVIDIA B200 Tensor Core GPU • AMD Instinct MI300X • Understand memory hierarchy, GPU scheduling, and model parallelism strategies Required Background • 5+ years experience in ML infrastructure or high-performance ML systems • Deep experience with LLM inference runtimes • Strong skills in: • Performance profiling • GPU utilization optimization • Systems debugging • Hands-on experience with: • vLLM, SGLang, or similar inference runtimes • GPU profiling tools • Python + systems-level debugging Nice to Have • Experience working with large-scale inference serving systems • Familiarity with GPU kernel profiling tools (Nsight, ROCm profiler) • Experience with distributed inference or model parallelism • Exposure to hyperparameter optimization frameworks such as DeepHyper • Previous work with cutting-edge AI hardware deployments What Makes This Role Unique • Work directly with next-generation AI hardware • Solve real performance problems on enterprise deployments • Build the core optimization engine of PrimaLabs • Close collaboration with founders and direct impact on customer success
Requirements
- ML infrastructure
- high-performance ML systems
- LLM inference runtimes
- Performance profiling
- GPU utilization optimization
- Systems debugging
- vLLM
- SGLang
- GPU profiling tools
- Python
Qualifications
- 5+ years experience in ML infrastructure or high-performance ML systems
Preferred Technologies
- ML infrastructure
- high-performance ML systems
- LLM inference runtimes
- Performance profiling
- GPU utilization optimization
- Systems debugging
- vLLM
- SGLang
- GPU profiling tools
- Python
Similar Jobs
ML Engineer
ML Engineer
Tide