Troubleshooting


Troubleshooting

This may be caused by multiple issues, especially if you obtained the script from others.  Please check the following list:

  1. Your script may refer some executables that are not in your search path.  Please locate the require executables and add them in your search path.
  2. Your script may require access of other scripts or executables or files from others' directory, where you are not authorized to access.  Please contact the owner and/or the system administrator to obtain the files or grant you access to such files.
  3. Your script may require some files, executables, or scripts that are not already available in this cluster system.  Please obtain and properly place and/or install them.  If system component, please ask the system administrator to install it for you.
  4. Your program might have been compiled on a different architecture, or a different version/flavor of linux.  Please find the correct version or obtain a copy of source code and compile it locally.
  5. Your program might require special license to run.  Please contact the author for a license.
  6. Your program or data set exceeds the limitation of RAM or storage space of the system.  Please downsize such requirement and try again by optimizing your RAM allocation and storage usage.
  7. Your code may be buggy.  Please debug you code.
  8. If none of the above or you cannot identify on yourself, please collect the script/program location, the full command line you attempted to run, and the errors you got, put them in a text file or an email, email it to the system administrator. He will help you to find out what has gone wrong.

This may be caused by multiple issues.  Please check for the following issues:

  1. You have set ending criteria too high to achieve or your data may not converge under your algorithm.  Please check your intermediate results for indicators.  If no indicators available, please kill your progress and attempt to enable your program to output such indicators to a text file, to help you debug.
  2. Your program demands too big data array to fit in the physical RAM of the node.  You can find this by the "ps ux" command looking for the MEM% column.  If it is close or over 100%, it means that your program might have dumped a significant portion of your data on virtual RAM, a space allocated on the local hard drive to prevent the system from halt against exhausting of RAM.  If this is the case, please kill your process and rerun it on a node with more RAM, or optimize your code and/or parameters to utilize less RAM.
  3. Your program may be doing disk I/O too frequently.  Knowing that RAM can be as fast as 12.8GB/s, while NFS attached storage array can only do 500MB/s at most.  There is 25 times difference.  And considering RAID can only process one I/O request per 10us, frequent disk IO can drastically slow down you program.  You may find out this if thousands of intermediate files appear and disappear from your working directory.  If this is the case, you may either optimize your program to reduce I/O, keep intermediate data in RAM as much as possible, or use /dev/shm as the folder to contain intermediate data that will be deleted when your pipeline is done.  Be sure to clean /dev/shm up after usage.  This is a virtual storage carved off from the physical RAM.  If occupied for a long time, it will impact the normal functionality of the node.
  4. You may be running a program that needs GPU on a node without GPU.  Please kill it and restart in a node with a GPU meeting the requirement of your program.
  5. Your program might be running on an already saturated node.  To check this, please read the "LOAD" column of the "qhost" output.  If this is the case, please kill your process and restart it on a less crowd node.  If all nodes are busy, please inform the system administrator.
  6. If none of the above, or you cannot figure out why, please first try kill the process and re-run.  If re-run gets stuck too, please document the command line and the location of your data, code and node, send it to the system administrator.  He will be able to help you find out why.