Senior Generative AI Engineer
About the job
About the Role We are building the next generation of multimodal foundation models that unify 3D volumetric data and natural language understanding. Our work extends beyond traditional 2D vision–language models (e.g., CLIP, LLaVA, SigLIP, Chitrarth, Patram) into models that can reason over, align, and generate insights from complex 3D structures, including scientific, medical, geospatial, and industrial volumes. This role is ideal for researchers who have contributed to multimodal alignment, cross-lingual VLMs, document understanding, or Indian-language foundation models, and are excited to bring those innovations into 3D reasoning and semantic grounding. You will help define a new class of models at the frontier of AI. What You Will Work On • Design and train 3D–text foundation models capable of interpreting MRI/CT/voxel grids, LiDAR cubes, or multi-view 3D data paired with textual descriptions. • Extend concepts from CLIP-style contrastive learning, vision encoders, multilingual encoders, and instruction tuning to volumetric domains. • Develop architectures that align 3D latent representations with language embeddings from multilingual LLMs. • Create novel multimodal datasets, including synthetic 3D volumes, annotated natural-language corpora, and domain-specific ontologies. • Publish at top venues (NeurIPS, ICML, CVPR, EMNLP, ACL) and collaborate with universities and research partners. • Push the frontier on cross-lingual multimodal grounding, making 3D understanding accessible across India's linguistic diversity. • Work with product teams to translate foundational breakthroughs into real-world applications in healthcare, robotics, manufacturing, and scientific discovery. Why This Role Appeals to Multimodal + CLIP Authors • Opportunity to extend 2D multimodal learning to 3D, a wide-open research space. • Freedom to experiment with contrastive learning, cross-modal retrieval, tokenization strategies for volumes, and multilingual conditioning. • Work with a team that values publication, benchmark building, and open science, similar to CM-CLIP, Chitrarth, or Patram ecosystems. • Direct impact on creating the first India-built 3D multimodal foundation model. • Access to high-compute clusters (A100/H100), data partnerships, and funding for ambitious long-horizon research. What We're Looking For • Research experience in multimodal learning, transformers, contrastive vision–language models, or LLM fine-tuning. • Demonstrated contributions to CLIP-like, VLM, document understanding, Indian-language foundation models, or multilingual NLP. • Strong skills in PyTorch/JAX, huggingface ecosystems, and large-scale training. • Experience with (or desire to learn) 3D data representations: voxels, meshes, NeRFs, point clouds, medical imaging, or multi-view embeddings. • Ability to build and articulate research hypotheses, run ablations, and produce clear publications. • Passion for pushing India's leadership in foundation model research. Nice to Have • Background in 3D geometry, 3D CNNs, diffusion models, or implicit neural representations. • Experience with Indian languages, code-mixed data, or multilingual VLMs. • Prior work with weakly supervised or noisy multimodal datasets. • Contributions to open-source multimodal libraries or benchmarks. What We Offer • Compensation commensurate with contributions • Compute-rich environment • Full support for publishing, and attending top conferences. • Collaboration with leading researchers in language, vision, and 3D AI. • A mission-driven setting where your work advances scientific and societal impact.
Requirements
- 3D Volumetric Data
- Natural Language Understanding
- Multimodal Models
- PyTorch
- JAX
Preferred Technologies
- 3D Volumetric Data
- Natural Language Understanding
- Multimodal Models
- PyTorch
- JAX
Benefits
- Compensation commensurate with contributions
- Computer-rich environment
- Support for publishing
- Collaboration with leading researchers
Similar Jobs
Senior Generative AI Engineer
TIGI HR
Generative AI Senior Engineer
VDart Digital