Job Description :
Position:  HPC System Administrator
Location:  Raleigh, NC



Job Description:




Principal Duties and Responsibilities
Cluster and Systems Administration: Manage and administer production systems used by researchers.
Maintenance of software environments: effective installation and configuration of open-source and commercial scientific applications.
Deploy and maintain hardware and/or cloud solutions for research scientific computing. This includes CPU and GPU-based grid compute, high speed networking and GPFS data storage.
Leverage industry standard system monitoring and reporting tools to ensure the maintainability, scalability and availability of the infrastructure environment.
Provide assistance to researchers in running applications (support/installation/configuration)
Troubleshoot scheduler submission problems, manage space usage, and assist with user access and Linux command line help. Applications include R, Python, etc.
Analyze and resolve customer and technical problems: Tuning cluster scheduling parameters, memory / CPU contention, scientific application compilation and run-time issues.
Analyzes result of server monitoring and implement changes to improve performance, processing and utilization. Proposes, maintains and enforces polices, practices and security procedures.
Provide break/fix support, setup/installation support, escalation support, and solutions support.
Develop and maintain system documentation as well as user-facing knowledge base articles and how-to guides.
Responsible for the inventory and tracking of HPC computer related equipment.
Perform other duties as required by the situation and circumstances.

Skills/Abilities/Competencies Required
Must be capable of contributing within a team, exhibit a high level of initiative, and have an eagerness to learn new technologies.
Demonstrated ability in providing systems administration, HPC cluster and scheduler troubleshooting to a community with diverse computing needs.
Knowledge of PuppetLabs Puppet management software and cloud computing platforms is desirable but not required
Candidate must possess advanced knowledge and understanding of Linux server configurations including networking (RHEL, CentOS, or equivalent), systems scripting (Unix shells, Python, Bash)
Knowledge and understanding of security and monitoring software packages including but not limited to O/S, network, application (nagios, Argus, Ganglia), intrusion detection.
Understanding of common network protocols like DHCP, DNS, SMTP, HTTP
Ability to multitask and prioritize work requirements, keeping team and management
informed.
Excellent interpersonal skills to effectively communicate with cross functional teams
including staff at all levels of the organization including both technical and non-technical