Lightning scheduler resources

9/10/2023

^[[2m^[[36m(train_mnist_tune pid=3005886)^[[0m All distributed processes registered. As you can see in the following log, the 3 nodes are blocked after Initializing distributed ^[[2m^[[36m(train_mnist_tune pid=3679869, ip=10.148.8.9)^[[0m GPU available: True (cuda), used: True But only the head node actually runs the training and the other 3 nodes seem blocked somewhere without giving any error. The cluster is correctly set up and the first trial are assigned to the 4 nodes. Print("Best hyperparameters found were: ", results.get_best_result().config) Train_fn_with_parameters = tune.with_parameters(train_mnist_tune, Ray.init(address=os.environ, _node_ip_address=os.environ) Model = LightningMNISTClassifier(config, data_dir) My test script is inspired by the tutorial to use PyTorch Lightning with Tune and the model used is exactly the same LightningMNISTClassifier. Ray start -include-dashboard=True -head -node-ip-address="$head_node_ip" -port=$port \ Srun -nodes=1 -ntasks=1 -w "$head_node" \ We split the IPV4 address as $head_node_ip" Head_node_ip=$(srun -nodes=1 -ntasks=1 -w "$head_node" hostname -ip-address)Įcho "IPV6 address detected. Nodes=$(scontrol show hostnames "$SLURM_JOB_NODELIST") Therefore, if I have 4 nodes each with 4 GPUs and 12 CPUs, my batch script is the following #SBATCH -job-name=test Hello! I am trying to deploy my Tune application on Slurm following this tutorial.

High: It blocks me to complete my task.

How severe does this issue affect your experience of using Ray?

0 Comments

Lightning scheduler resources

Leave a Reply.

Author

Archives

Categories