EL9 Known Issues


ConnectX-3 HCAs and MPI

There appears to have been a regression affecting one of the legacy Mellanox InfiniBand drivers between EL7 and 9.  MPI on nodes with ConnectX-3 HCAs (farm and sciml nodes numbered less than 1960) no longer scales well past about 128 ranks.  Replacing those HCAs is unlikely being that the Farm sees very little MPI usage.  Farm19 numbered 60 and greater have ConnectX-4 HCAs, which use the current, non-legacy Mellanox driver, and hosts with the first two numbers 20 and greater have ConnectX-5 or greater.

So, if you want to run an MPI job on the Farm that excludes the ConnectX-3 HCAs, you can select nodes with ConnectX-4 HCAs and any newer node generations, for example, --constraint="[farm23|cx4ib]".  Farm nodes with ConnectX-4 IB HCAs are only and exactly the higher-numbered subset of Farm19 that do not have ConnectX-3 HCAs, so you do not need to specify farm19&cx4ib to ensure the "Matching OR" square-bracket expression doesn't mix node generations.

Other issues?

Please submit a ServiceNow incident for any new issues found.