Job Description :
Site Reliability Engineers (SRE) will fill the mission-critical role of ensuring that our complex, web-scale systems are healthy, monitored, automated, and designed to scale. You will use your background as an operations generalist to work closely with our development teams from the early stages of design all the way through identifying and resolving production issues. The ideal candidate will be passionate about an operations role that involves deep knowledge of both the application and the product, and he/she will also believe that automation is a key component to operating large-scale systems.

Responsibilities:
Serve as a primary point responsible for the overall health, performance, and capacity of one or more services
Gain deep knowledge of our complex applications
Assist in the roll-out and deployment of new product features and installations to facilitate our rapid iteration and constant growth
Develop tools to improve our ability to rapidly deploy and effectively monitor custom applications in a large-scale UNIX environment
Work closely with development teams to ensure that platforms are designed with "operability" in mind
Function well in a fast-paced, rapidly changing environment
Participate in a 24x7 rotation for second-tier escalations


Basic Qualifications:
BA/BS Degree in Computer Science or related technical discipline, or related practical experience
4+ years experience with Python, Go, Ruby, Java, or similar programming language
2+ years experience with Linux systems engineering and administration
1+ years experience with cloud (Azure, AWS, GCP) architecture and administration


Preferred Qualifications:
Python experience, specifically for systems automation
Experience with configuration management software like Puppet, Chef, CFengine, Ansible, Terraform. Airflow
Experience dealing with time series data for service insights and reporting
Strong interpersonal communication skills (including listening, speaking, and writing) and ability to work well in a diverse, team-focused environment with other SREs, Engineers, Product Managers, etc.
Good RESTful API and systems design sensibilities
Experience with general performance tuning and optimization of all aspects of platforms and services (systems, network, code)
Experience in troubleshooting that spans systems, network, and code
Broad understanding of Internet protocols and network programming
Primary scope of work is Airflow and python programming and managing Azure resources.
             

Similar Jobs you may be interested in ..