Job Description :

Conduct post-mortem reviews of system down time with internal stakeholders to put short- and long-term solutions in place to eliminate repeat occurrences.
Conduct risk analysis to review system shortcomings that present risk of downtime for application stacks. Continuously improve our internal processes and controls to ensure optimal performance.
Implement DevOps changes and rollouts and shepherding deployment in a manner leading to optimal results.
Combine software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures our internally critical and our externally-visible systems have reliability and uptime appropriate to users'' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Use configuration management tools to create repeatable environments.
Create dashboards which communicate and alert on the overall system health to less technical colleagues.
Develop system configuration management templates, and audit systems against those templates over the system lifecycle.
Work with developers to quickly identify and address issues to provide smooth code rollouts and seamless change back-out when there are problems.


Performing code deployments.
Install configure and maintain middleware and ESB environments.
Install configure and maintain “cloud” hosting technologies.
Install configure and maintain API gateways.
Routine load testing of our systems.
Optimize platform builds and automation.
System / service performance tuning, troubleshooting and debugging.
Use tools like Puppet, Satellite, Jenkins, Hudson, ELK Stack, Terraform, Ansible, Salt & Splunk.


Bachelor Degree in Computer Science or similar area. Experience may be considered in lieu of a degree.
Minimum of four (4) years of Linux systems administration experience.
Working knowledge and experience with Networking fundamentals.
Expert skill level in Scripting and Automation.
Expert in high-availability and load balancing technologies.
Willingness to document technical processes and share knowledge with others. Capable of following and composing process and procedure documentation, as well as training other users on complex topics.
Ability to interact with colleagues from all levels of the organization, both technical and non-technical, and communicate technical ideas effectively.
Proven ability to work independently with minimal supervision.