Job Description :

Job Description

Role and responsibilities:

  • 5+ years of experience in Site/System Reliability, DevOps, or related roles.
  • Strong skills in Linux/Unix administration and shell scripting.
  • Proficiency with cloud platforms (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
  • Knowledge of networking fundamentals (TCP/IP, DNS, load balancing).
  • Proficiency in Linux/Unix administration, scripting (Python, Bash, or similar).
  • Experience with monitoring tools (Prometheus, Grafana, DataDog).
  • Familiarity with containerization (Docker, Kubernetes) and cloud services.
  • Experience with CI/CD systems (Jenkins, GitHub Actions, GitLab CI).
  • Strong analytical and problem-solving skills.
  • Knowledge of security practices (IAM, encryption, secrets management).
  • Experience with incident management frameworks and SRE principles.
  • Knowledge of performance tuning and capacity planning.
  • Exposure to observability tools and log aggregation systems.
  • Understanding of networking and security fundamentals.
  • Design, implement, and maintain monitoring, logging, and alerting systems.
  • Define and track Service Level Indicators (SLIs), Objectives (SLOs), and Agreements (SLAs).
  • Conduct post-incident reviews and implement preventive measures.
  • Automate deployment, scaling, and operational tasks using Infrastructure-as-Code tools (Terraform, Ansible, CloudFormation).
  • Implement CI/CD pipelines and release management processes.
  • Optimize infrastructure for reliability, performance, and cost efficiency.
  • Respond to production incidents, perform root cause analysis, and implement solutions.
  • Collaborate with development teams to ensure system robustness.
  • Maintain runbooks and operational documentation.
  • Partner with software developers, QA, DevOps, and product teams to improve system reliability.
  • Promote best practices in coding, testing, and deployment.
  • Advocate for proactive measures to prevent outages and reduce operational toil.
  • Ensure systems adhere to security, compliance, and governance standards.
  • Participate in vulnerability assessments and remediation planning.

Equal Opportunity Employer
We are an equal opportunity employer. All aspects of employment including the decision to hire, promote, discipline, or discharge, will be based on merit, competence, performance, and business needs. We do not discriminate on the basis of race, color, religion, marital status, age, national origin, ancestry, physical or mental disability, medical condition, pregnancy, genetic information, gender, sexual orientation, gender identity or expression, national origin, citizenship/ immigration status, veteran status, or any other status protected under federal, state, or local law.

             

Similar Jobs you may be interested in ..