AI Infrastructure Engineer- L3

HCLTech

★3.57 / 5

Noida • Not disclosed

12 hours ago

On-Site

About the job

The AI Infrastructure Engineer (L3) provides advanced engineering and architectural expertise for high‑performance AI and ML infrastructure. This role focuses on building, optimizing, and scaling GPU/accelerator environments and distributed systems for large‑scale training and inference workloads. Responsibilities: • Deploy, configure, and manage GPU and AI accelerator platforms (NVIDIA A100/H100/L40, AMD Instinct, TPU). • Troubleshot GPU hardware and software issues, including failures, thermal throttling, PCIe/NVLink topology, and driver conflicts. • Install, upgrade, and maintain GPU software stacks, including drivers, CUDA, cuDNN, TensorRT, and firmware. • Perform capacity planning and resource optimization for AI training, fine‑tuning, and inference workloads. • Optimize Linux systems (Ubuntu, RHEL, Rocky) for AI/HPC workloads through NUMA, kernel, and clock tuning. • Manage distributed and high‑performance storage systems, including BeeGFS, Lustre, Ceph, and high‑throughput NFS. • Operate high‑bandwidth, low‑latency networks, including InfiniBand, RoCE, RDMA, and NVLink. • Administer Kubernetes GPU clusters, leveraging NVIDIA GPU Operator, device plugins, MIG, and node feature discovery. • Support AI and HPC orchestration platforms, including Kubeflow, Ray, MLflow, and Slurm/PBS. • Configure and manage GPU scheduling and sharing strategies, such as node pools, quotas, job queues, and fair‑share policies. • Optimize distributed training workflows using NCCL, PyTorch Distributed, Horovod, and DeepSpeed. • Operate and tune LLM and inference runtimes, including vLLM, Triton Inference Server, and TensorRT‑LLM. • Monitor and tune GPU utilization, memory allocation, and container-level performance. • Automate cluster provisioning and operations using Terraform, Helm, Customize, and GitOps (ArgoCD/Flux). • Build automation for GPU diagnostics, node onboarding, and model deployment workflows. • Implement observability and telemetry using Prometheus, Grafana, NVIDIA DCGM, and OpenTelemetry. • Lead deep‑dive root cause analysis for GPU, network, storage, and orchestration issues. • Provide L3 support and work with L2/L1 teams for escalations. • Drive production readiness, patching, hotfix rollout, and reliability improvements across AI infrastructure. • Troubleshoot & escalation for complex platform failures. • Deep debugging of: NCCL hangs, GPU fabric issues and coordinate with OEM and support vendors on critical issues. • Review RCA, architecture documents, and change plans. • Act as technical advisor to leadership and customers. Qualifications & Experience: • Bachelor’s degree in computer science, Engineering, Information Technology, or related field. • 8–12 years of overall infrastructure or platform engineering experience. • 4–6 years of specialized experience supporting AI/ML workloads. • Demonstrated experience in large‑scale GPU/accelerated computing and distributed systems. • Strong experience in Kubernetes, containerization, and orchestration tools. • Understanding of AI workload and MLOps. Certifications Required • NVIDIA Certified Associate – AI Infrastructure • NVIDIA NPN Certification • NVIDIA Base Command Manager certification • AWS Solutions Architect Associate • CKA – Certified Kubernetes Administrator • CKAD – Certified Kubernetes Application Developer.

Requirements

AI Infrastructure
GPU Orchestration
Kubernetes
Cloud Optimization

Qualifications

Bachelor’s degree in Computer Science
Engineering
Information Technology

Preferred Technologies

AI Infrastructure
GPU Orchestration
Kubernetes
Cloud Optimization

About the company

HCLTech offers a range of IT services, focusing on advanced engineering and architectural expertise for AI and ML infrastructure.

Similar Jobs

AI / ML Engineer

Akoni Technologies

Anand•Not disclosed

2 days ago•Remote

AI / ML Engineer

Akoni Technologies

Anand•Not disclosed

3 days ago•Remote

Agentic AI Engineer / Lead

People Prime Worldwide

New Delhi•Not disclosed

2 days ago•On-Site