Job Description :

Job Title: Site Reliability Engineer (SRE) (multiple openings)

Location: Plano, TX  (Need Day 1 onsite or within a month or two)

Mandatory Skills:

Jenkins, Puppet, Dynatrace, AppDynamics, Kubernetes, monitoring tools, cloud, AWS, Java, microservice, Ubuntu(Linux), Maven, Grafana



? Responsible for how code is deployed, configured, and monitored, as well as the availability, latency, change management, emergency response, and capacity management of services already in / going to


? Design, code, test and deliver software to automate manual operational work, develop self-service, auto-detection and healing

? Develop software for reliability and scale, ensuring minimal refactoring or changes

? Define, monitor and defend SLOs

? Deploying closed-loop remediation – continuous testing and remediation—to fix problems in pre-production before software is released to production.

? Build custom tooling from scratch to meet specific needs in the incident management workflow.

? Complex incident resolution across public cloud, private cloud, 3rd party, and on-premise tech.

? Leverage Chaos Engineering to find and prevent future problems and to confirm fixes from past incidents function as intended.

? Focus on end-user experiences and partner with development teams to implement changes to increase uptime and performance based on empirical evidence.

? Troubleshoot priority incidents, facilitate blameless post-incident evaluations and ensure permanent closure of incidents

? Identify application patterns and analytics in support of better service level objectives

? Design performance tests, identify bottlenecks and opportunities for optimization and capacity demands, and present solutions for continuous improvements

? Design best in class monitoring frameworks to accomplish end-to-end flow monitoring and noiseless alerting

? Design automated software and product upgrades, change management and release management solutions



? Bachelor’s degree or equivalent experience in a software engineering discipline

? 2-3 years of SRE or System Engineering experience.

? Expert in at least one technology stack designing, coding, testing, delivering software e.g., Java, Python, C++, Go, etc.

? Deep knowledge of Internet protocols and web services technologies e.g., HTTP, DNS, TCP/UDP, SOAP, JSON, Apache, Tomcat and REST

? Experience working with containers e.g., Docker, Kubernetes, Cloud Foundry, etc.

? Experience in working with automation tools e.g., Ansible, Puppet, Selenium etc.

? In-Depth OS Experience e.g., RHEL, Ubuntu, Windows Server with strong debugging, troubleshooting, and problem-solving skills

? Testing and build automation with a continuous integration/continuous delivery (CI/CD) pipeline e.g., Travis CI, Maven, Gradle, Groovy, Git, Terraform, Jenkins etc.

? Experience deploying and managing services on modern platforms e.g., AWS, GCP, Azure.

? Strong experience in using industry standard monitoring tools e.g., AppDynamics, Dynatrace, APICA, Splunk, ELK, FluentD, Prometheus, Kibana, Elasticsearch, Grafana, Nagios, Datadog, New Relic, etc.

? Advanced understanding of application monitoring stack (Logs, Events Metrics & Alerts) and ability to visualize and setup end-to-end observability

? Certified in one or more cloud technology e.g., AWS, Azure, GCP or RedHat is a big plus


Similar Jobs you may be interested in ..