Job Description :
Position: Site Reliability Engineer (SRE)- Observability & Incident Response
Duration: 6 Month
Location: Remote

Job Description: 
  • For a large financial services client, we are seeking Site Reliability Engineers (SRE)?with expertise on the AWS cloud. 
  • In this role, your responsibilities span traditional IT and software development for a portfolio of application (or applications) as you bridge the gap between developers and IT operations in a mature DevOps culture.
  • As an SRE your responsibilities for your portfolio of application(s) are ever expanding, from Observability and Incident Response to Automation to Software Development to improve resiliency and application functionality.
  • The ideal candidate will come from a Java software engineering background and will have solid experience in AWS, error budgeting, reliability models, toil elimination, observability, and incident management.
Responsibilities:
  • Work with architecture and development teams to create resilient and reliable?architecture and fault tolerant application design using performance engineering & chaos engineering principles.
  • Develop and deploy tools and utilities to automate manual operational tasks in production.
  • Analyze production utilization and incidents patterns, identify improvement areas and implement automation?to improve productivity, avoid manual tasks and recurring incidents.
  • Work with application stakeholders and define non-functional requirements covering performance,?scalability, availability, resiliency and reliability including Service Level Objectives, Service Level Indicators and Error Budgets.
  • Independently determine the needs of the customer while identifying and resolving conflicting or?complementary needs across customer groups.
  • Develop strategies to address the Non-functional requirements throughout Software or Product Development?Life Cycle.
  • Responsible for incidents related to NFRs, updating SOPs to capture right set of metrics/logs for RCA, Root?cause analysis of the incidents, Solutions identification and Ensure permanent closure of the incidents.
  • Apply advanced skill, knowledge and experience, design and develop software solutions to meet?customer needs.
  • Use a process-driven approach to leading design solutions.
  • Implement new software technology and coordinate simultaneous implementation tasks across teams.
  • May maintain or oversee the maintenance of existing software.
Required Experience and Qualifications:
  • Bachelor’s degree and 7+ years of relevant professional experience.
  • Strong AWS Cloud experience (4+ years)
  • Current or recent Site Reliability Engineer experience (2+ years)
  • Experience with incident management and response (2+ years)
  • Observability experience with Splunk Dynatrace, Datadog, including building dashboards (2+ years)
  • Excellent verbal and written communication skills with experience presenting information and/or ideas to an?audience in a way that is engaging and easy to understand
Additional Helpful Experience:

  • AWS Certified Solutions Architect, AWS Certified SysOps Administrator,?Splunk Certified Developer, Dynatrace, Sun Certified Java Programmer.
  • Automation experience with Selenium, Blueprism, or Ansible.
  • Expertise with Resiliency and building fault tolerant design patterns.
  • Experience collaborating cross-functionally on availability / performance issues in order to identify root cause, determine areas for improvement, and drive those actions to closure through effective solutions.
  • Extensive knowledge of principles, advanced techniques, and theories to suggest and implement solutions on?a specific project, program, or product.
  • Adept at managing project plans, resources, and people to ensure successful project completion in an Agile /?Scrum environment.
  • Experience mentoring teams in the writing of Performance and Chaos Engineering strategies and scripts .
  • Skilled as a full stack developer with a focus on cross-platform optimization and responsiveness of?applications.
  • Strong understanding and knowledge of Java/J2EE technologies and frameworks – UI/JavaScript frameworks,?Spring Boot/ Spring Cloud Frameworks, REST, Microservices, server-side frameworks.
  • Knowledge on Cloud technologies and containerization using Docker & Kubernetes.
  • Experience in the use of DevOps/CICD tools including Jenkins, Jules and?Automated deployment tools.
  • Working knowledge on one of Unix operating systems.
  • Knowledge of performance tuning of enterprise level Java/J2EE applications (Web and Application Servers?Configuration, JVM parameters tuning, GC and Heap Size, Message Broker).
  • Experience in implementing resiliency design patterns using Hystrix, Resilience4J, Service Mesh or similar?frameworks and validation using chaos monkey type frameworks.
  • Skilled in cloud technologies and cloud computing to include Amazon Web Services (AWS) offerings, development, and networking platforms.
  • Experience defining, measuring, and improving Reliability Metrics (SLO/SLI), Observability (Monitoring,?Logging-Tracing solutions), Operations Processes (Incident, Problem Management), and Operations Toil Reduction through Automation.
  • Experience designing, building and implementing necessary dashboards from application and infrastructure?health perspectives using tools such as Splunk Dynatrace, Datadog, etc. to provide a single pane view of all?critical business and operational information to relevant stakeholders.
             

Similar Jobs you may be interested in ..