About the job
The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for high‑performance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for large‑scale training and inference workloads. Responsibilities: • Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU). • Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts. • Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware. • Perform capacity planning and resource optimization for AI training, fine‑tuning, and inference workloads. • Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning. • Manage distributed and high‑performance storage systems, including BeeGFS, Lustre, Ceph, and high‑throughput NFS. • Operate high‑bandwidth, low‑latency networks, including InfiniBand, RoCE, RDMA, and NVLink. • Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery. • Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS. • Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fair‑share policies. • Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed. • Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRT‑LLM. • Monitor and tune GPU utilization, memory allocation, and container-level performance. • Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux). • Build automation for GPU diagnostics, node onboarding, and model deployment workflows. • Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry. • Lead deep‑dive root cause analysis for GPU, network, storage, and orchestration issues. • Provide L3 support and work with L2/L1 teams for escalations. • Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure. • Troubleshoot & escalation for complex platform failures. • Deep debugging of: NCCL hangs, GPU fabric issues and coordinate with OEM and support vendors on critical issues. • Review RCA, architecture documents, and change plans. • Act as technical advisor to leadership and customers. Qualifications & Experience: • Bachelor’s degree in computer science, Engineering, Information Technology, or related field. • 8–12 years of overall infrastructure or platform engineering experience. • 4–6 years of specialized experience supporting AI/ML workloads. • Demonstrated experience in large‑scale GPU/accelerated computing and distributed systems. • Strong experience in Kubernetes, containerization, and orchestration tools. • Understanding of AI workload and MLOps. Certifications Required • NVIDIA Certified Associate – AI Infrastructure • NVIDIA NPN Certification • NVIDIA Base Command Manager certification • AWS Solutions Architect Associate • CKA – Certified Kubernetes Administrator • CKAD – Certified Kubernetes Application Developer.
Requirements
- AI Infrastructure
- GPU Orchestration
- Kubernetes
- Cloud Optimization
Qualifications
- Bachelor’s degree in Computer Science
- Engineering
- Information Technology
Preferred Technologies
- AI Infrastructure
- GPU Orchestration
- Kubernetes
- Cloud Optimization
About the company
HCLTech offers a range of IT services, focusing on advanced engineering and architectural expertise for AI and ML infrastructure.
Similar Jobs
AI / ML Engineer
Akoni Technologies
AI / ML Engineer
Akoni Technologies
Agentic AI Engineer / Lead
People Prime Worldwide