Job Description :

Job Description:

ROLE  : SRE Developer
LOCATION  : 
San Francisco /LA / Seattle, WA (Remote)
DURATION  : Long Term Contract

 

Responsibilities:

·        Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements

·        Managing services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment

·        Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence

·        Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health.

·        Provide production system management, change management, incident response including emergency response and postmortems.

·        On-call rotation is required.

 

Minimum qualifications:

·        Bachelor's degree or above, majoring in Computer Science or related fields

·        Must be responsible, interpersonal self-starters, comfortable with ambiguity, excellent communicators, and problem solvers with 5 to 7 years’ experience in technical operations, dev ops and/or infrastructure support with excellent Linux skills.  

·        3+ years hands on experience supporting application stack through Linux CLI

·        3+ years of application troubleshooting experience working with Linux internals (kernel, process, thread, memory etc.,) 

·        3+ years of bash/shell scripting to automate 

·        Good understanding of TCP/UDP protocols to support Linux platform

·        5+ years of experience in one or more of the following types of systems at their newest versions:

·        Prior experience with configuration and maintenance of common applications such as

·        DNS, Nginx, Docker, Kubernetes, MySQL

·        Working knowledge of shell scripting languages using bash including Python and Go

·        Experience supporting infrastructure and services ranging from on-prem to public cloud environments GCP or AWS

·        Available on a 24X7X365 basis when needed for production impacting incidents or key customer events

·        Familiarity with Redis and/or MongoDB, Kafka, Rocket MQ, HDFS, Mesos, Yarn, Spark,

·        Hive Terraform and/or Elasticsearch o Familiarity with Git

·        Experience in debugging and automating routine tasks

·        Oracle cloud support, automation experience, technical writing and design experience is a plus.  

·        Excellent team player focused on getting things done

·        Experience of supporting/managing systems at scale (10s thousands to 100s thousands instances) is a big plus

             

Similar Jobs you may be interested in ..