Job Description :

Job Title: DevOps Lead with Incident Management exp

Location:  Philadelphia, PA (Onsite)

Duration: Long Term

Position Type: Contract


Linkedin ID is must


Education: BE/B.Tech, MCA, BSc (IT), BCA

Must Have:  

7+ years of experience managing server level incidents and running incident management programs, preferably in large-scale environments

5+ years of experience on various DevOps tools like Ansible, Kubernetes, Puppet, Chef, Jenkins, Docker, SVN, and GIT to integrate automation and managing various applications

Expert in Linux and must be RHEL Certified, System Administration

Experienced in creation & modification of multiple Python, Ruby and Shell Scripts for various application-level tasks.

Experience in Designing and Implementing servers on Open stack Platform through Terraform

Experience in working with JIRA & Service Now tools to plan, track, support and close requests, tickets, and incidents. 

Working Experience in Installation and configuration of monitoring tools like Splunk, Kibana, Grafana, OP5 and Prometheus for different environments.

Experience working within a sprint wise environment.


Additional Notes:

Outstanding communication and presentation skills, written and verbal. Excellent listening skills and a high degree of empathy

Good analytical and problem-solving skills to troubleshoot systems problems and analyze the complex architectural environment

Expert in ITSM and ITIL Certification is an added advantage.

 Role and Responsibilities:

· Facilitate E2E coordination of critical incidents occurring in client environment (Business applications and infrastructure)

· Responsible for application level/server level monitoring using multiple tools 24/7. Build Proactive monitoring environment for all environments and self-healing for repeat incidents in Nagios, Op5, Splunk, Grafana

· Initiating Bridge Calls and coordinating with the resolver groups until the issue is resolved

· Senior level troubleshooting & Coordination of technical restoration actions and plans for Major Incidents, P1 and P2 incidents in a multi-supplier ecosystem

· Ensuring outages are E2E driven by maintaining authority during technical bridges for faster and successful resolution

· Be involved in Performing Root Cause Analysis (RCA) for domain related incidents, create and maintain recovery playbooks/ Standard Operating Procedures (SOP), for commonly occurring customer patterns and issues

· Liaises with High Priority Incident Manager counterparts and proactively remain cognizant of industry trends to develop and promote best practice.

· Use GSH, Puppet, and Ansible for testing, triaging, and fixing bugs in lower environment with frequent feedback from production infrastructure and applications.

· Proactively monitor SLA performance and report on them accurately. Ensure MTTD, MTTT & MTTR are met and develop ways to improve the product quality using Splunk, Grafana

· Work on User Stories in sprints & interact with stakeholders to ensure business requirements are met

· Participate in Tabletop & Simulated War Games -Monitor affected DC, failover to redundant site, do health checks, Identify/note any issues observed during the war game activity, coordinate with other teams, communication to stakeholders etc

· Adhere all best practices wrt to ITSM/ITIL standards.


Similar Jobs you may be interested in ..