Search for More Jobs

Get alerts for jobs like this Get jobs like this tweeted to you

Company: AMD

Location: Hyderabad, TS, India

Career Level: Mid-Senior Level

Industries: Technology, Software, IT, Electronics

Apply on company website View all jobs at this company

Description

WHAT YOU DO AT AMD CHANGES EVERYTHING

At AMD, our mission is to build great products that accelerate next-generation computing experiences—from AI and data centers, to PCs, gaming and embedded systems. Grounded in a culture of innovation and collaboration, we believe real progress comes from bold ideas, human ingenuity and a shared passion to create something extraordinary. When you join AMD, you'll discover the real differentiator is our culture. We push the limits of innovation to solve the world's most important challenges—striving for execution excellence, while being direct, humble, collaborative, and inclusive of diverse perspectives. Join us as we shape the future of AI and beyond. Together, we advance your career.

Senior Staff Engineer – Training Software Release & Performance Infrastructure

THE ROLE:

AMD is looking for a highly motivated senior individual contributor to help build and scale a training software release and performance validation capability in Hyderabad. You will own critical pieces of infrastructure and execution for training software stack releases, nightly performance validation, and regression triage for large-scale AI workloads on AMD Instinct™ accelerators.

THE PERSON:

This role is pivotal to ensuring the quality, stability, and performance competitiveness of AMD's AI training software ecosystem across PyTorch, JAX, Megatron-LM, Torchtitan, and related frameworks. You will work hands-on across CI, benchmarking, automation, triage, and cross-stack debugging spanning frameworks, ROCm components, kernels, drivers, and compilers.

KEY RESPONSIBILITIES:

Training Stack Release Ownership (Hands-on IC)
- Own key aspects of end-to-end training software stack releases, including release readiness checks, validation planning, sign-off inputs, and issue tracking.
- Build and maintain release qualification workflows: reproducible test environments, checklists, dashboards, and reporting.
- Define and implement quality/performance gates for internal and external releases (e.g., pass/fail criteria, perf thresholds, stability metrics).
Nightly Performance Validation & Regression Detection
- Design, implement, and operate nightly performance regression testing for representative LLM and multimodal training workloads.
- Own benchmark methodology for repeatability and signal quality (noise reduction, run-to-run variance, statistical checks, and alerting).
- Develop automation for triage routing, trend analysis, and “known issue” suppression to reduce false positives.
Failure Triage & Root-Cause Analysis Across the Stack
- Drive debugging and root-cause analysis of test failures and performance regressions
- Produce high-signal defect reports: minimal repros, bisects, perf traces, and clear ownership recommendations.
- Partner with component owners to ensure fixes land quickly and are validated end-to-end.
Infrastructure, CI/CD, and Benchmarking Engineering
- Build/extend CI systems for distributed multi-node training validation (scheduling, orchestration, artifact capture, result publishing).
- Improve reliability of test infrastructure: environment management, dependency pinning, containerization, and fleet health checks.
- Create and maintain dashboards and tooling for performance tracking, regression attribution, and release status.
Cross-Team Technical Collaboration
- Work closely with global teams across Frameworks (PyTorch, JAX, Megatron-LM, Torchtitan), ROCm drivers, kernels, compiler, and hardware teams
- Performance modeling and customer enablement teams.
- Communicate clearly with stakeholders in the U.S., China, and India on regression impact, release risks, and priorities.
- Provide technical leadership through design reviews, documentation, and mentoring (without direct people management requirements).

PREFERRED EXPERIENCE:

Strong background in systems engineering or ML infrastructure, with experience working on large, complex software stacks.
Hands-on experience with deep learning training frameworks such as PyTorch, JAX, Megatron-LM, or similar.
Experience building or operating CI/CD pipelines, test automation, and production-grade validation workflows.
Demonstrated ability to perform performance benchmarking and regression analysis (profiling, tracing, statistical rigor, reproducibility).
Solid understanding of GPU/accelerator performance, distributed training concepts, and multi-node systems (networking/collectives/data parallelism).
Strong communication skills and experience collaborating across global, cross-functional teams
Experience supporting production training releases or nightly CI for large ML platforms.
Familiarity with LLM training, MoE models, and large-scale distributed training stacks.
Experience working close to hardware/software co-design or performance optimization (kernel-level tuning, compiler interactions, memory/communication bottlenecks).
Experience building new infrastructure or owning a new technical charter from scratch (driving technical direction as an IC).

ACADEMIC CREDENTIALS:

Master's degree in Computer Science, Computer Engineering, Electrical Engineering, or equivalent

#LI-PK1

Benefits offered are described: AMD benefits at a glance.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

AMD may use Artificial Intelligence to help screen, assess or select applicants for this position. AMD's “Responsible AI Policy” is available here.

This posting is for an existing vacancy.

Apply on company website

Senior Staff Engineer – Training Software Release & Performance Infrastructure Job Listing at AMD in Hyderabad, TS (Job ID 80519-en-us)

Description

Job Seekers

Senior Staff Engineer – Training Software Release & Performance Infrastructure Job Listing at AMD in Hyderabad, TS (Job ID 80519-en-us)

Description

Find Connections via Linkedin

General Tips

Asking for Help

Getting Introduced

Job Seekers