Job Description :

The ideal candidate will collaborate with the core teams combining software practices and engineering to strengthen the application/system reliability along with operational support. Advanced knowledge of system architecture, network, application development, testing, and operational stability will help transform the way the teams operate today. The candidate will possess advanced scripting and coding capabilities to develop artifacts for alert & event correlation ingested from diverse monitoring sources and leverage AI/ML to automate recovery actions.

Five or more years of experience as a Site Reliability Engineer
Architect a new framework to establish an SRE Model across multiple teams
Develop new processes to prevent problem recurrence and automated recoveries.
Enhance SLO trending and centralized reporting (ex. Grafana dashboard integration)
Identify opportunities to improve architecture/engineering practices
Mentor staff to replace manual processes with automation
Collaborate across all level of the organization to drive the SRE model
Advanced experience in supporting enterprise container based platforms
Strong Systems & Network Architecture experience
Experience in cloud technologies such as architecting, developing or maintaining cloud solutions in public cloud environments (AWS/OCI/GCP)
Data ingestion & enrichments – Webhooks, REST API design, JSON, XML, SMTP
CI/CD - Deployment pipeline experience (Jenkins, Ansible)
Devops container/orchestration tools (Kubernetes, Docker, Puppet, etc)
Good knowledge of Python, bash or similar scripting languages
Experience with Configuration Management systems
Knowledge of Unix/Linux based systems, and experience troubleshooting applications running on these systems
Experience with software lifecycle including design, implementation, and delivery
Expertise in designing, analyzing and troubleshooting large-scale distributed systems
Ability to apply a systematic approach to solve problems with a sense of ownership and focus
Effective communication skills with the ability to articulate technical details to different audience

Requirements (emphasis on Moogsoft)
Installation, Infra & Config:
Linux Systems Administration and Operations experience.
Network Administration experience.
JavaScript experience.
Familiarity with the Moogsoft installation procedures.

Integrations & Dev
Familiarity with WebHooks, REST API, JSON, XML, SMTP.
Development experience with a popular scripting language (Python) and Unix Shell Scripting.
Familiarity with SQL Query
Proficient in Jenkins & Ansible
Proficient in Grafana reporting tools.

Clustering & Workflows
Familiarity with Operations (SRE) workflows, responsibilities and organizational structures.
Familiarity with predetermined and dynamic correlation, entropy, anomaly detection concepts.
Strong SQL/PERCONA DB experience.
Experienced communicator and collaborator.

Platform Monitoring
Systems Administration and Operations experience.
Network Administration experience.
Development experience with a popular scripting language (Python, GO, Ruby), JavaScript and Unix Shell Scripting
Familiarity with Moogsoft components and data flows.
Understanding of monitoring and metrics concepts. (Volume, Performance, Capacity)


Similar Jobs you may be interested in ..