Dammam 7 HPC Cluster Aramco Saudi Oct 2020 – Present
Project description
The following activities are managed within the project scope:
Supporting STCS Project delivery team in Delivering the Cluster to the Operations team:
- Supporting setting the Delivery Criteria and ensuring the Deliverables are delivered correctly
- Supporting in various performance Tests and Supporting in preparing the cluster for production
- Participating in developing monitoring systems and automated availability reports
- Supporting the Installation and configuration of a replicated central Authentication System for the different cluster categories of nodes
- Supporting the implementation team to integrate the authentication system in the cluster OS images
- Designing and implementing central replicated DNS servers to maintain unified hostname schemas for the processing cluster, storage cluster and general purpose application nodes
- Configuring the cluster to comply with Aramco security requirements
Operating Dammam 7 cluster
- Installing and configuring HPC Applications and creating module Files
- Assisting Aramco users to install their applications and troubleshooting errors
- Maintaining the clusters jobs scheduler, and troubleshooting reasons for job failures
- Changing the OS images to add the required packages
- Tuning the Scheduler Configuration, controlling the maximum allowed resources to be used, adding cgroups configurations, Epilogs and Prologs for jobs tracking, cleaning and stability
- Adding nodes to GPFS Cluster, and creating filesets as required
- Stabilizing the cluster, adding more health checks to the nodes, and configuring triggers to exclude faulty nodes
- Installing more monitoring tools (Influxdb, Grafana) for visibility on different cluster components
Supporting STCS Application Team Developing Utilities Apps
- Deploying a full development environment on STCS cloud environment
- Providing STCS development team with API collections to assist the development team integrating the Application with Slurm
- Participating in the Design Process and choosing the most secure way to integrate the application with Aramco databases and with Dammam 7 Cluster
- Assisting the Development team creating job template and passing input to the jobs from the GUI application
- Assisting the Development team improving the application and adding more functionalities for managing the Jobs through the GUI Application
Visualization HPC Clusters Aramco Saudi Jul 2020 – Aug 2020
Project description
The following activities are managed within the project scope:
Deployment Planning
HPC cluster Implementation
- The whole phases of implementation are managed via Ansible Playbooks to automate provisioning and configuration.
- Image deployment of the nodes of each cluster operating system on up to 136 nodes.
- Management nodes HA .
HPC Cluster UAEU Mar 2020 – April 2020
Project description
Implementation for HPC Cluster stack using the following technologies
- Storage Cluster HA with RHEL (Pacemaker and Corosync).
- Storage Box PowerVault ME4 series.
- Cluster Management using Bright Computing version 9.0.
- Cluster workload management PBSpro , Altair Access Web, Altair Control .
- Mellanox for Infiniband network.
- Cluster bench marking HPL and Burn check for the cluster.
HPC Cluster UAEU Jun 2018 – Jul 2018
Project description
Implementation for HPC Cluster stack using the following technologies
- Cluster Management using Bright Computing version 7.3
- Cluster workload management PBSpro.
- HPL and Burn check for the cluster.
Cloud Revamp VFE Feb 2019 – Dec 2019
- Project description
- Implementation of Cloud Service Provider using VMware technology Stack and Automation layer with vRealize Orchestrator.
- Gather and discuss the different business needs from VFE to be able to translate them into Workflows using the built-in ones along with chaining between them.
- Provide the prerequisites for the testing environment with iQuest Hybris marketplace.
- Provide iQuest with simple workflow document shows the input and output parameters, Providing the Workflows IDs that will be triggered from Hybris side
Technologies:
· (VRO, VCD, NSX-v, VROps, ESXI and vCenter)