Site Reliability Engineer
100% REmote
Mandatory skill: CI/CD, AWS and/or GCP , Python or Bash or Groovy, monitoring tools like Datadog, Ansible, JMeter.
Key Responsibilities
•
Support and enhance observability (monitoring, logging, alerting) across production systems
•
Help maintain SLIs/SLOs for key services
•
Participate in evaluating services for production readiness
•
Collaborate with development teams to identify reliability risks and improve system architecture
•
Contribute to automation of operations, including CI/CD pipelines, incident response, and infrastructure provisioning
•
Participate in incident response and on-call rotations for critical services
•
Contribute to post-incident analysis and drive reliability improvements
•
Partner with security, infrastructure, and product teams to support performance, compliance, and operational excellence
Must-Haves
•
Willingness to work onsite and participate in a 24/7 on-call rotation as needed
•
5+ years of experience managing and supporting high-traffic digital platforms
•
Strong experience with CI/CD pipelines and deployment automation
•
Experience with cloud platforms such as AWS and/or GCP
•
Solid scripting skills (e.g., Python, Bash, Groovy)
•
Hands-on experience with observability and monitoring tools like Datadog, New Relic, AppDynamics, or similar
•
Understanding of web, mobile, and OTT architectures
•
Experience supporting large scale websites, Mobile and OTT applications, microservices, APIs, and distributed systems
•
Experience with infrastructure-as-code tools such as Ansible, Terraform, or Chef
•
Familiarity with performance testing tools like JMeter or k6
•
Hands on experience with debugging tools like Charles Proxy or Fiddler
Preferred Qualifications
•
Experience working with CDNs (e.g., Akamai) and reverse proxies (e.g., NGINX, Varnish)
•
Exposure to video streaming platforms and Familiarity with application/infrastructure security controls and best practices
•
Certifications in SRE, DevOps, or Performance Engineering are a plus