Summary
Overview
Work History
Education
Skills
Projects
Accomplishments
Timeline
Generic

Mohamed ElSayed Kandil

Principal HPC Systems Engineer
Alexandria

Summary

Detail-oriented team player with strong organizational skills. Ability to handle multiple projects simultaneously with a high degree of accuracy. Organized and dependable candidate successful at managing multiple priorities with a positive attitude. Willingness to take on added responsibilities to meet team goals.

Overview

6
6
years of professional experience
5
5
years of post-secondary education

Work History

Principal HPC systems Engineer

Brightskies Technologies
01.2022 - Current
  • Responsible for setup High performance Computing Platforms
  • Perform technical planning, hardware sizing, application workload assessment, system integration, verification and validation, and supportability and effectiveness analyses for total systems
  • Manage and monitor all installed systems and infrastructure
  • Install, configure, test and maintain operating systems, application software and system management tools
  • Proactively ensure the highest levels of systems and infrastructure availability
  • Monitor and test application performance for potential bottlenecks, identify possible solutions, and work with developers to implement those fixes
  • Maintain security, backup, and redundancy strategies
  • Write and maintain custom scripts to increase system efficiency and lower the human intervention time on any tasks
  • Liaise with vendors and other IT personnel for problem resolution
  • Lead a small team of engineers and guide them through projects.

Senior HPC Systems Engineer

Brightskies Technologies
10.2017 - 01.2020
  • Design and conduct POCs to showcase the proposed HPC solutions
  • Deploy & administer turnkey HPC solutions
  • Strong experience with cluster management tools such as xCAT, Bright Computing, and CMU
  • Strong experience in deploying and administering PBSpro workload manager,Torque, Altair Access Web, and Altair Control
  • Compiling and performing HPC Benchmarks and optimizing the results and point to the cluster bottlenecks and problems
  • Installation and configuration of HPC production clusters
  • Advanced knowledge in distributed file systems Beegfs and lustre
  • Automating configuration and provisioning Infrastructure using Ansible
  • Containerize applications using Docker
  • Continuous integration and deployment using gitlab-ci, jenkins.

Cloud Systems Engineer

Brightskies Technologies
10.2017 - 01.2020
  • Implementation for POC for HPC Cluster on demand with Bright Computing Openstack
  • Competent in implementation of openstack infrastructure ( Horizon, Nova, Glance, Keystone, Swift, Cinder, Neutron)
  • Installation and configuration for virtualized environments based on vsphere 6
  • Installation and configuration for NSX
  • Installation and configuration for vCloud director and Integration with Active Directory
  • Managing vCloud Director infrastructure through the service provider portal and tenant portal
  • Integration with vCloud director and vRops tenant app
  • Upgrade for vCloud director 9.5 to version 9.7
  • Installation and configuration for vRealize Orchestrator
  • Implementation and configuration for different plugins and connections with VRO to the different components of VMware Cloud stack (vCenter, NSX, VCD, AD)
  • Building design and flowcharts to automate the process of workload creation over VCD products offering as IAAS for VRO

Application System Engineer

Pharmaoverseas
05.2017 - 10.2017
  • Responsible for the management of the ERP environment (Application, Database (DB2), Operating systems (AIX, SUSE), Servers (IBM Power Servers) and storage Flash System)
  • Configuring, monitoring, tuning and troubleshooting the ERP technical environment
  • Collaborate to resolve SAP transport and source code problems
  • Install new / rebuild existing servers and configure hardware, peripherals, services, settings, directories, storage, etc
  • In accordance with standards and project/operational requirements
  • Perform daily system monitoring, verifying the integrity and availability of all hardware, server resources, systems and key processes, reviewing system and application logs, and verifying completion of scheduled jobs such as backups
  • Perform regular security monitoring to identify any possible intrusions
  • Implement an optimal ERP configuration to maximize system performance and availability
  • Install and configure all required SAP database servers, Operating system and application servers.

Education

B.Sc. - Communication and Electronics Engineering

Alexandria University
01.2008 - 01.2012

Professional Diploma - undefined

Information Technology Institute (ITI)
01.2016 - 01.2017

Post-graduate - undefined

Skills

    Bright Computing, Xcat, CMU

undefined

Projects

  

Dammam 7 HPC Cluster Aramco Saudi Oct 2020 – Present

Project description

The following activities are managed within the project scope:

Supporting STCS Project delivery team in Delivering the Cluster to the Operations team:

  • Supporting setting the Delivery Criteria and ensuring the Deliverables are delivered correctly
  • Supporting in various performance Tests and Supporting in preparing the cluster for production
  • Participating in developing monitoring systems and automated availability reports
  • Supporting the Installation and configuration of a replicated central Authentication System for the different cluster categories of nodes
  • Supporting the implementation team to integrate the authentication system in the cluster OS images
  • Designing and implementing central replicated DNS servers to maintain unified hostname schemas for the processing cluster, storage cluster and general purpose application nodes
  • Configuring the cluster to comply with Aramco security requirements

Operating Dammam 7 cluster

  • Installing and configuring HPC Applications and creating module Files
  • Assisting Aramco users to install their applications and troubleshooting errors
  • Maintaining the clusters jobs scheduler, and troubleshooting reasons for job failures
  • Changing the OS images to add the required packages
  • Tuning the Scheduler Configuration, controlling the maximum allowed resources to be used, adding cgroups configurations, Epilogs and Prologs for jobs tracking, cleaning and stability
  • Adding nodes to GPFS Cluster, and creating filesets as required
  • Stabilizing the cluster, adding more health checks to the nodes, and configuring triggers to exclude faulty nodes
  • Installing more monitoring tools (Influxdb, Grafana) for visibility on different cluster components

Supporting STCS Application Team Developing Utilities Apps

  • Deploying a full development environment on STCS cloud environment
  • Providing STCS development team with API collections to assist the development team integrating the Application with Slurm
  • Participating in the Design Process and choosing the most secure way to integrate the application with Aramco databases and with Dammam 7 Cluster
  • Assisting the Development team creating job template and passing input to the jobs from the GUI application
  • Assisting the Development team improving the application and adding more functionalities for managing the Jobs through the GUI Application


Visualization HPC Clusters Aramco Saudi Jul 2020 – Aug 2020

Project description

The following activities are managed within the project scope:

Deployment Planning

HPC cluster Implementation

  • The whole phases of implementation are managed via Ansible Playbooks to automate provisioning and configuration.
  • Image deployment of the nodes of each cluster operating system on up to 136 nodes.
  • Management nodes HA .


HPC Cluster UAEU Mar 2020 – April 2020

Project description

Implementation for HPC Cluster stack using the following technologies

  • Storage Cluster HA with RHEL (Pacemaker and Corosync).
  • Storage Box PowerVault ME4 series.
  • Cluster Management using Bright Computing version 9.0.
  • Cluster workload management PBSpro , Altair Access Web, Altair Control .
  • Mellanox for Infiniband network.
  • Cluster bench marking HPL and Burn check for the cluster.


HPC Cluster UAEU Jun 2018 – Jul 2018

Project description

Implementation for HPC Cluster stack using the following technologies

  • Cluster Management using Bright Computing version 7.3
  • Cluster workload management PBSpro.
  • HPL and Burn check for the cluster.


Cloud Revamp VFE Feb 2019 – Dec 2019

  • Project description
  • Implementation of Cloud Service Provider using VMware technology Stack and Automation layer with vRealize Orchestrator.
  • Gather and discuss the different business needs from VFE to be able to translate them into Workflows using the built-in ones along with chaining between them.
  • Provide the prerequisites for the testing environment with iQuest Hybris marketplace.
  • Provide iQuest with simple workflow document shows the input and output parameters, Providing the Workflows IDs that will be triggered from Hybris side

Technologies:

· (VRO, VCD, NSX-v, VROps, ESXI and vCenter)

Accomplishments


  • Oracle Cloud Infrastructure Certified Architect Associate
  • Oracle Cloud Infrastructure Foundation Certified Associate
  • Red Hat Certified System Engineer RHCE (Certificate number:180-054-934)
  • Red Hat Certified System Administrator RHCSA (Certificate number:180-054-934)
  • Introduction to computer science and programming (edx MITx).

Timeline

Principal HPC systems Engineer

Brightskies Technologies
01.2022 - Current

Senior HPC Systems Engineer

Brightskies Technologies
10.2017 - 01.2020

Cloud Systems Engineer

Brightskies Technologies
10.2017 - 01.2020

Application System Engineer

Pharmaoverseas
05.2017 - 10.2017

Professional Diploma - undefined

Information Technology Institute (ITI)
01.2016 - 01.2017

B.Sc. - Communication and Electronics Engineering

Alexandria University
01.2008 - 01.2012

Post-graduate - undefined

Mohamed ElSayed Kandil Principal HPC Systems Engineer