Job Description:
ROLE : SRE Developer
LOCATION : San Francisco /LA / Seattle, WA (Remote)
DURATION : Long Term Contract
Responsibilities:
· Ensure the reliability, availability, and performance of services through stability and automation product development, disaster recovery plan, emergency response and chaos engineering and system resilience improvements
· Managing services, responsible for operational support, 24X7 troubleshooting, automation design and development including deployment
· Troubleshoot and diagnose issues, propose, and implement solutions to reduce frequency of occurrence
· Meet service-level-agreements (SLAs) or service-level-objective (SLOs) by measuring and monitoring service availability, performance, and overall system health.
· Provide production system management, change management, incident response including emergency response and postmortems.
· On-call rotation is required.
Minimum qualifications:
· Bachelor's degree or above, majoring in Computer Science or related fields
· Must be responsible, interpersonal self-starters, comfortable with ambiguity, excellent communicators, and problem solvers with 5 to 7 years’ experience in technical operations, dev ops and/or infrastructure support with excellent Linux skills.
· 3+ years hands on experience supporting application stack through Linux CLI
· 3+ years of application troubleshooting experience working with Linux internals (kernel, process, thread, memory etc.,)
· 3+ years of bash/shell scripting to automate
· Good understanding of TCP/UDP protocols to support Linux platform
· 5+ years of experience in one or more of the following types of systems at their newest versions:
· Prior experience with configuration and maintenance of common applications such as
· DNS, Nginx, Docker, Kubernetes, MySQL
· Working knowledge of shell scripting languages using bash including Python and Go
· Experience supporting infrastructure and services ranging from on-prem to public cloud environments GCP or AWS
· Available on a 24X7X365 basis when needed for production impacting incidents or key customer events
· Familiarity with Redis and/or MongoDB, Kafka, Rocket MQ, HDFS, Mesos, Yarn, Spark,
· Hive Terraform and/or Elasticsearch o Familiarity with Git
· Experience in debugging and automating routine tasks
· Oracle cloud support, automation experience, technical writing and design experience is a plus.
· Excellent team player focused on getting things done
· Experience of supporting/managing systems at scale (10s thousands to 100s thousands instances) is a big plus