Job Description :
Key Responsibilities
Establish Kubeflow as a production-grade service offering.
Establish AI Platform as a production-grade service offering.
Partner with Cortex Platform product teams to deliver comprehensive integrated solutions, e.g.
ML Pipeline (TFX & KFP) orchestration on Kubeflow Pipelines
Model Serving on Kubeflow Serving
Distributed Training and Tuning on AI Platform Training
Deployments on Kubeflow and AI Platform Notebooks
Partner with Platform teams (CAT, Security, Compute etc) to ensure security and data handling compliance for Kubeflow and AI Platform.
Establish tooling as needed to support the team mission, e.g. tools to facilitate automated provisioning of new Kubeflow clusters on-demand.
Provide documentation, best practices, recommendations and other guidance on utilizing Kubeflow and AI Platform for the use cases.
Establish SLAs for ML Infra systems components.
Participate in 24/7 on-call rotations (Pagerduty) for mission critical ML Infra systems components.


Key Skill Sets
Experts in Kubernetes and Docker
Experts in Kubeflow, particularly in systems operations (Kubernetes, Istio etc), debugging, deployment on GKE and utilization
Experts in AI Platform, particularly in integration, debugging and utilization
Deep understanding of the ML Engineering process and tooling ecosystems, particularly as they map to GCP
Ability to develop and contribute to tools written in Python and/or Bash
Ability to quickly and deeply understand customer and engineering partner requirements and work with various teams to adapt solutions accordingly
Ability to collaborate with upstream projects (e.g. Kubeflow) and vendors (e.g. Google) to conduct bug reporting, feature enhancement requests and upstream code & doc contributions.