Cluster Use


Cluster Use

The cluster is assembled with multiple generation of nodes with variety of configurations.  Among all 26 nodes, 8 of them are equipped with advanced GPUs, 5 of them are equipped with entry level GPUs.  See table below.

Nodes configurations

NodeNVidia GPU modelRAMCUDABest for
master22x NVIDIA A40 48GB1TB13.1Testing and pipeline development
node31x NVIDIA RTX A2000 6GB4TB13.0GPU accelerated data processing that need large RAM
node82x Tesla P100 16GB512GB12.3Older applications
node91x NVIDIA A40 48GB512GB13.1General GPU accelerated applications
node15-18Quadro P1000 4GB192GB12.8GPU accelerated data processing via SGE
node23Tesla P100 16GB192GB10.2Basic machine learning and data processing 
node248x NVIDIA A40 48GB1TB13.0Machine learning and AI training
node27,284x NVIDIA A40 48GB1TB13.0Machine learning and AI training
node232x NVIDIA L40S 48GB1.5TB13.2Machine learning and AI training with large matrix

You log in to one of our three head nodes. 

For test run, programming and non-node specific debugging, and GUI interactive data management and processing, please stick on your head node.

If your software or pipeline support SGE or SLURM, you may submit them from any node.  Better enclose your submission script in a screen process, so that you can go back to check if everything has been running well.

For non-SGE applications that can run in non-interactive way, especially for those that either takes a long time or occupies significant amount of CPU powers, please avoid running them on head nodes.

Nodes specifications and recommended applications

NodeOSSpecificationBest for
master1Rocky Linux 9.62x Intel 32 core 384GBDedicated file server, not for data processing use
master2Rocky Linux 9.72x AMD EPYC 28 core 1TB, 2x A40 GPUManagement console, head node, not for data processing use
node3Rocky Linux 10.02x Intel Xeon 32 core 4TB, A2000 GPULarge RAM node, Collaborator reserved
node4-6N/ARetired. Vacant for new nodesN/A
node7Rocky Linux 9.6Intel 20 core 128GBGateway node for backup architecture, facility connection
node8Rocky Linux 9.6Intel 56 core 512GB, 2x NVidia Tesla P100 16GB GPUGPU accelerated application, Emory login node
node9Rocky Linux 10.12x AMD EPYC 28 core 512GB,  A40 GPUGPU accelerated application
node10Rocky Linux 9.72x Intel Xeon 20 core 64GBBackup server #2
node11-14Rocky Linux 9.62x AMD EPYC 32 core 512GBCPU intensive application
node15-18Rocky Linux 9.62x Intel 16 core 384GB, NVidia Quadro P1000 GPUCPU intensive application that can take some GPU helps
node19-22Rocky Linux 9.62x Intel 10 core 128GBCPU intensive application
node23CentOS 7.62x Intel 14 core 192GB, NVidia Tesla P100 16GB GPUEmory login node, Collaborator Researved
node24Rocky Linux 9.62x AMD EPYC 32 core 1TB, 8x A40 GPUMachine Learning, AI
node25Rocky Linux 9.72x AMD EPYC 28 core 256GBBackup server #1
node26Rocky Linux 9.72x Intel Xeon 24 core 256GBBackup server #3
node27-28Rocky Linux 9.62x AMD EPYC 32 core 1TB, 4x A40 GPUMachine Learning, AI, File Servers, Emory Log in Nodes
node29Rocky Linux 9.72x AMD EPYC 24 core 1.5TB, 2x L40S GPUAll applications

Yes and no.

Yes:

  • Different nodes are equipped with different generation of CPUs, with or without GPUs, and slightly different operating system.  So, your code may not work exactly the same on each node, especially for processing speed.
  • Some software are limited by license on certain node. Like lcmodel is limited on node7.
  • Due to the complexity of the user requests and the nature of architecture difference among nodes, it has been impossible to keep all OS and software package versions consistent across the cluster.

No:

  • Data file system and user profile are mounted consistently across the cluster, as well as locale and system profiles.  So you should not have path and file accessibility issue across the cluster.
  • Major managed data processing software are mirrored across the cluster.  They are supposed to produce same results if running on same input. 

Baseline: generally, it should get same result if you run your pipeline on different node without changing your code, though may finish in very different length of time, except the following pitfalls.

Pitfall:

  • If your software can automatically decide use or not use GPU, and has optimization against number of cores and total RAM, you should expect to get different results on different node;
  • Your code may run only on some node but crash on others, if you compiled it on a higher version OS, or your code consumes too much RAM, or optimized against a certain hardware architecture.