We are seeking an experienced Site Reliability Engineer (SRE) with more than 12 years of hands-on expertise in building, maintaining, and improving large-scale, highly available systems. The ideal candidate will have strong skills in automation, cloud infrastructure, performance optimization, monitoring, and incident response.
-
Design, implement, and maintain highly reliable, scalable, and secure systems.
-
Develop automation to reduce manual operational tasks and eliminate repeated issues.
-
Build and maintain CI/CD pipelines to support continuous delivery and deployment.
-
Manage cloud infrastructure (AWS, Azure, or GCP) including networking, security, and scaling.
-
Create and maintain monitoring, logging, and alerting systems using modern tooling.
-
Lead incident response, root-cause analysis, and post-incident reviews.
-
Improve system performance and reliability through capacity planning and performance tuning.
-
Work closely with software engineering teams to ensure smooth production operations.
-
Implement infrastructure-as-code using Terraform, Ansible, or similar tools.
-
Ensure compliance with security and operational standards.
-
12+ years of experience in Site Reliability, DevOps, or Production Engineering roles.
-
Strong hands-on experience with cloud platforms (AWS, Azure, or GCP).
-
Expertise in CI/CD pipelines and automation tools such as Jenkins, GitLab, or GitHub Actions.
-
Proficiency with containerization and orchestration (Docker, Kubernetes).
-
Experience with monitoring tools such as Prometheus, Grafana, ELK, Splunk, or Datadog.
-
Strong scripting/programming skills (Python, Bash, Go, or similar).
-
Familiarity with networking concepts, load balancing, and distributed systems.
-
Solid understanding of security best practices and infrastructure governance.
-
Experience managing high-availability systems in production environments.