The hands-on job of an engineer for HPC platform is responsible for the design, review and collaboration with computation infrastructure team for a future proof HPC compute platform. You will manage stability of the existing HPC platform in production while forming standards and processes for future HPC platforms and a uniform repeatable deployment strategy. You will provide a key role in troubleshooting the HPC computing platform both software and hardware, container virtualization, network, and other software issues even if these go beyond the platform. Robust definition of software & platform development processes, install, and upgrade (ansible, K8s, Container, Docker, shell script) are some of your main duties.
- Support applications of the software to HPC in both research and production environments
- Identify, design and implement the architecture solutions to meet efficient and effective needs of image processing computing infrastructures for high throughput requirement
- Enhancement, debug and maintain legacy computation software system.
- Analyze the performance of the computation system to help identify performance bottlenecks.
- Software issue analysis, debugging and technical support.
- Implement unit test and have good practice in integration test, regression test and documentation.
- Collaborate and evaluate designs and solutions of cloud applications, hardware, and software.
- Familiar with parallel computing techniques on multi-core computational systems
- Strong collaboration skills with manufacturing and design teams
- Maintenance and creation of Linux OS environment playbooks that are used in software deployment.
- Support development teams at San Jose and other sites where they experience potential software platform issues
- Identifying the implications when a move from one software version to the next is required.
- Development of automated tests that can be re-used on platform changes and upgrades to ensure no regression impact is caused.
- Be able to work with Linux and Python for test execution and scripting purposes.
- Associate’s or Bachelor’s Degree or equivalent experience in related field
- 10 + years IT experience
- 10 years of experience designing & architecting Linux environments (specifically Linux, HPC)
- Experience with Load Sharing Facility (Platform LSF) is highly desirable
- Experience with IBM HPC (High Performance Compute) platform is highly desirable
- Experience with managing Ansible or other CMS administration and support
- General experience with MMR (Management, monitoring and reporting), specifically with Nagios, and/or ELK stack is desirable
- Experience configuring and maintaining SELinux and firewalld
- User maintenance tasks with knowledge of the integration with Active Directory
- Ability to setup and install of a full Linux, Apache, MySQL, MongoDB, and PHP (LAMP) environment from scratch
- Ability to set up and administer a Subversion/Scmbug/Bugzilla system for version control
- Knowledge of Linux networking setup
- Understanding of Yum and RPM for package management
- Ability to write scripts in one of the major shell scripting languages for use in cron and for system administration
- Understanding of Postfix configuration
- Understanding of Samba
- Understanding of SSH, RSA keys and their setup
- Understanding of the Linux init process
- Understanding how to monitor CPU, memory, disk space, and overall performance is essential
- Strong communication skills and ability to work well with others is essential
- Understanding of cloud technologies, such as AWS, Azure, etc.
- Experience with IBM HPC, gpfs and TSM (Tivoli Storage Manager)
- Background in Perl, grep, sed, awk
- Understanding of how enterprise server hardware is setup and how to add devices to the configuration
- Expertise with high-performance networking, ideally with MPI, NCCL, RDMA, and/or Infiniband
- Experience with GPUs in large scale networks strongly preferred
- Deep understanding of TCP/IP and the Linux networking stack
- Experience developing high-quality software in a general-purpose programming language (Python, C, C++, Go, etc)
- Experience with virtualization and container architecture in cloud environments
- Configuring, administering, and supporting network storage subsystems (e.g. IBM, NetAppl)
- Working knowledge of Microsoft Windows System Administration in order to be able to communicate effectively with other members of the SysAdmin team
- RedHat Certification is a definite plus
- Experience working with EMC, IBM or enterprise storage technologies
Core skills & Competencies:
- Ability to collaborate with others
- Excellent communication skills
This job description reflects management’s assignment of essential functions. Nothing in this job description restricts management’s right to assign or reassign duties and responsibilities to this job at any time.
Caris Life Sciences is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, religion, color, national origin, gender, gender identity, sexual orientation, age, status as a protected veteran, among other things, or status as a qualified individual with disability.
*Interested parties please email your resume and cover letter to Valarie Perez at email@example.com. #LI-VP1