Add example for distributed pytorch geometric (graph learning) with Ray AIR
This only showcases distributed training, but with data small enough that it can be loaded in by each training worker individually. Distributed data ingest is out of scope for this PR.
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
In https://github.com/ray-project/ray/blob/ray-1.11.0/docker/ray-ml/Dockerfile, the order of pip install commands currently matters (potentially a lot). It would be good to run one big pip install command to avoid ending up with a broken env.
Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* bump timeout
* Write a more informational result dict.
* Revert changes to compute config files that are not used.
* add smoke test
* update
* reduce timeout
* Reduce the # of env per worker to 1.
* Small fix for getting trial_states
* Trigger build
* simply result dict
* lint
* more lint
* fix smoke test
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>