Job Description :
Sr. Site Reliability Engineer with .NET/Windows/C++ or C#
Someone with good DevOps, some SRE and a .net development background may work
Interview - all remote video calls

JOB SUMMARY:
Site Reliability Engineers (SREs) are embedded directly with our product and engineering teams, working closely with them to design, develop, ship, and motivate the creation of software and systems to increase product reliability and organizational efficiency. We drive adoption of modern reliability practices like defining SLIs/SLOs, establishing error budgets, participation in on-call rotations, conducting blameless post-mortems, chaos testing, and end-to-end ownership with the teams they work with.

MINIMUM QUALIFICATIONS AND REQUIREMENTS:
· Bachelor’s Degree and 5+ years of relevant experience or Master’s with 4 years of prior relevant experience. At least 3 years of relevant professional experience may be substituted for a bachelor’s degree (i.e. minimum of 8 years of relevant experience
· 5+ Experience with C++, C#, ASP.net, JavaScript, and various scripting languages (PowerShell, Python, bash, etc
· 5+ years’ experience with orchestration tools such as ServiceNow, Chef, SCCM, Puppet, Terraform etc.
· 4+ years’ experience as a Site Reliability Engineer
· Proficiency in distributed systems (architectures, micro-services)
· Proficiency in Container Orchestration tools (Docker, Kubernetes)
· Proficiency in Cloud Services and Architecture (AWS, Azure)
· Proficiency in modern monitoring tools (New Relic, PRTG)
· Proficiency in logging services (Splunk, ELK Stack)
· Proficiency in Networking concepts (TCP/IP, Routing, Firewalls, and Network Security, triaging, packet loss)
· Knowledge in database technologies and basic DBA skills (Mongo, Postgres, MySQL)
· Knowledge in PCI Security Standards
· Experience working in a SaaS environment at scale
· Strong understanding of Software Development Life Cycle

PRINCIPAL DUTIES AND RESPONSIBILITIES:
50% of your time will be:
· Ensure sufficient logging, monitoring and alerting strategies around availability, latency, and overall system health.
· Partner with teams to establish SLI/SLO for their service
· Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
· Participate and co-host incident reviews and blameless postmortems.
· Continuously improve Incident and Problem Management practices, automation, tools, and implementation.
· Drive incident and problem to root cause, document, follow up and track on remedial/preventive actions end to end.
· On-call rotation supporting 24/7 rapid response to both customer impacting incidents and leading indicators of system distress.
· Collaborate on projects to improve IT services by creating documentation, providing recommendations, and upgrading or expanding systems.
· Solve novel and exciting problems with an emphasis on reducing toil.
· Mentor and coach junior engineers.
· All other duties as assigned.
             

Similar Jobs you may be interested in ..