## Why are these changes needed?
It's part of redis removal project. This PR focus on using gcs kv in internal kv.
- gcs client is introduced
- internal kv is updated to use gcs rpc client based kv
- related code got updated.
The other PR will update components using redis to use internal kv.
## Related issue number
https://github.com/ray-project/ray/issues/19443
## Why are these changes needed?
In this test case, the following case could happen:
1. actor creation first uses all resource in local node which is a GPU node
2. the actor need GPU will not be able to be scheduled since we only have one GPU node
The fixing is just a short term fix and only tries to connect to the head node with CPU resources.
## Related issue number
#19438
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* Add an RLlib Tune experiment to UserTest suite.
* Add ray.init()
* Move example script to example/tune/, so it can be imported as module.
* add __init__.py so our new module will get included in python wheel.
* Add block device to RLlib test instances.
* Reduce disk size a little bit.
* Add metrics reporting
* Allow max of 5 workers to accomodate all the worker tasks.
* revert disk size change.
* Minor updates
* Trigger build
* set max num workers
* Add a compute cfg for autoscaled cpu and gpu nodes.
* use 1gpu instance.
* install tblib for debugging worker crashes.
* Manually upgrade to pytorch 1.9.0
* -y
* torch=1.9.0
* install torch on driver
* bump timeout
* Write a more informational result dict.
* Revert changes to compute config files that are not used.
* add smoke test
* update
* reduce timeout
* Reduce the # of env per worker to 1.
* Small fix for getting trial_states
* Trigger build
* simply result dict
* lint
* more lint
* fix smoke test
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
* Fix QMix, SAC, and MADDPA too.
* Unpin gym and deprecate pendulum v0
Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1
Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.
Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20
* Reformatting
* Fixing tests
* Move atari-py install conditional to req.txt
* migrate to new ale install method
* Fix QMix, SAC, and MADDPA too.
* Unpin gym and deprecate pendulum v0
Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1
Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.
Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20
Move atari-py install conditional to req.txt
migrate to new ale install method
Make parametric_actions_cartpole return float32 actions/obs
Adding type conversions if obs/actions don't match space
Add utils to make elements match gym space dtypes
Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
This reverts commit f1eedb15b6.
## Why are these changes needed?
Self node should avoid reading any updates from gcs for node resource change since it'll maintain local view by itself.
## Related issue number
#19438
This reverts commit a907168184.
## Why are these changes needed?
This PR seems to have some huge perf regression on `placement_group_test_2.py`. It took 128s before, and after this PR was merged, it took 315 seconds.
## Related issue number
* Unpin gym and deprecate pendulum v0
Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1
Lastly, all of the RLlib tests and Tune tests have
been moved to python 3.7
* fix tune test_sampler::testSampleBoundsAx
* fix re-install ray for py3.7 tests
Co-authored-by: avnishn <avnishn@uw.edu>
## Why are these changes needed?
If an actor failover is triggered, but the RPC connection between the caller and the crashed actor instance is not disconnected automatically, subsequent tasks to the new actor instance may not be executed. The root cause is that the sequence numbers of tasks sent to the new actor instance is not starting from 0. Details can be found in #14727.
This PR fixes it by ensuring all inflight actor tasks fail immediately when actor failover is detected (via actor state notifications).
## Related issue number
closes#14727
## Why are these changes needed?
When gcs broad cast node resource change, raylet will use that to update local node as well which will lead to local node instance and nodes_ inconsistent.
1. local node has used all some pg resource
2. gcs broadcast node resources
3. local node now have resources
4. scheduler picks local node
5. local node can't schedule the task
6. since there is only one type of job and local nodes hasn't finished any tasks so it'll go to step 4 ==> hangs
## Related issue number
#19438
This PR includes a script for building wheels for Macs with M1 processors. It roughly follows the pattern of the other scripts with a few differences.
Manually installs nvm
Uses miniforge conda to install python/pip instead of python foundation .pkgs
Doesn't pin numpy (we probably shouldn't be pinning it in the other scripts either...)
Commit detection falls back to git instead of erroring
All of these changes were made so that the script works on a laptop, which comes with a subset of the dependencies that the x86 buildkite image comes with.