Job Description :

Job Title: Systems Engineer (Site Reliability Engineer (SRE)) 

Job ID: 37137 

Location: Bothell, WA 98011 

Duration: 12 Months with possible extensions 

Interview Process: Phone/WebEx 

Number of Positions: 6 

Locations: Bothell, WA, Dallas, TX, Atlanta, GA, San Ramon, CA, Chicago, IL, St. Louis, MO will need to be onsite once Covid restrictions lifted. 

**Contract to Hire position*** 

Required Skills: 

Site Reliability Engineer: 

Java/Python/Shell scripts/

Production Support / Operations environment: 

Docker/ Kubernetes/Cloud: 

UNIX/Networking/troubleshooting: 

Agile/Lean Agile/Scaled Agile: 

Quantum Metric/TeaLeaf: 

Dynatrace/AppDynamics/Introscope: 

Kibana/Grafana: 

EFK stack (preferred): 

Top 5 Skills / Additional Job Posting Description Details * 

• 3+ years’ experience in Java, Python, Shell scripts, Node.js, React.js 

• 5+ years’ experience in Production Support / Operations environment 

• 2+ years’ experience using Docker, Kubernetes and Cloud environments 

• 2+ years of strong UNIX, Networking and troubleshooting knowledge 

• 3+ years of experience in Agile, Lean Agile and/or Scaled Agile methodologies 

• 2+ years of experience in Customer Experience Analytics tool like Quantum Metric or TeaLeaf 

• Solid understands and experience in Application Performance Monitoring tools like Dynatrace, AppDynamics, Introscope, etc. 

• Experience with visualization tools like Kibana and Grafana. EFK stack experience preferred. 

• Excellent communication and collaboration skills. Be able to confidently speak and work with people with different backgrounds. Should be able to facilitate meeting independently and be able to articulate ideas and issues thoroughly. 

About the Job

Our Digital Operations team is looking for a Site Reliability Engineer (SRE) who is passionate about the customer experience and has analytical & multi-tasking abilities to thrive in a fast-paced environment. The SRE is responsible for ensuring that, as new features and applications are introduced to production, essential aspects for reliability such as availability, resiliency, latency, efficiency, change management, monitoring, emergency response, and capacity planning are conducted alongside development of the new features/applications. The SRE will develop automation code & scripts to proactively address customer issues, reduce mean time to repair and improve application availability. The position also includes collaborating closely with feature delivery teams as a bridge between development and operations by applying a software engineering mindset to system administration. This position will split time between operations/on-call duties and developing systems and software that help increase site reliability and performance to deliver business value. The Software Engineer/SRE will need intimate knowledge of the current state of data-center and cloud infrastructure, CI/CD pipeline tools, Kubernetes, Site Reliability Engineering practices, and ability to implements the plan for desired future state. Attention to detail and strong analytical skills are required, along with a “Customer-First” attitude!

Roles & Responsibilities:  

1) Design, develop, implement, and document end-to-end solutions across IT and Network organizations.  

2) Lead planning with third party IVR vendors to create interface/APIs to support integrations with Hosted Integrated contact Services platform.  

3) Provide requirements as part of solutions to automate operations, eliminate manual reentry, improve cycle time and meet work center needs and product needs.  

4) Develop end-to-end flows to support Sales/Contracting, Ordering, Provisioning, Maintenance, Reporting, external user interfaces, system test, and User Acceptance Test.  

5) Troubleshoot network problems with 3rd party vendors and customers.  

6) Ability to look into multiple systems and create network reports. 

Responsibilities and Day-to-Day View

• Build software to help operations and support teams - Proactively build and implement services to make operations more effective and reduce toil. This includes adjustments to monitoring and alerting to automating scripts and code in production. Candidate can be tasked with building a homegrown tool from scratch to help with issues in software delivery or resolving impacts from outages/incident. 

• Fix support escalation issues; Optimize on-call rotations and processes - Improve system reliability through the optimization of on-call processes. Add automation and context to alerts – leading to better real-time collaborative response from on-call responders. Additionally, update runbooks, tools and documentation to help prepare on-call teams for future incidents. 

• Document “tribal” knowledge - Gain exposure to systems in both staging and production, and take part in work with software development, support, IT operations and on-call duties – to build up historical knowledge over time. Instead of silo-ing this knowledge, ensure constant upkeep of documentation and runbooks to ensure that teams get the information they need right when they need it

• Conducting post-incident reviews - Thorough and transparent post-incident reviews to keep teams honest and ensure that everyone is conducting post-incident reviews, documenting their findings and taking action on their learnings. Take action items for building or optimizing parts of the SDLC or incident. 

             

Similar Jobs you may be interested in ..