Commit graph

104 commits

Author SHA1 Message Date
Chen Shen
97582a802d
[Core] update protobuf to 3.19.4 (#25648)
The error message in #25638 indicates we should use protobuf>3.19.0 to generated code so that we can work with python protobuf >= 4.21.1. Try generating wheels to see if this works.
2022-06-18 16:06:56 -07:00
Yi Cheng
9fe3c815ec
[serve] Integrate GCS fault tolerance with ray serve. (#25637)
In this PR, we integrate GCS fault tolerance with ray serve. 

- Add timeout with 5s for kv.


Rollback should be added to all methods, which will come after.

Basic testing for KV timeout in serve and deploy is added.
2022-06-17 23:50:39 -07:00
Stephanie Wang
293c122302
[dataset] Use polars for sorting (#25454) 2022-06-17 12:26:46 -07:00
Simon Mo
ef1b565699
[CI] Pin starlette and fastapi version (#25604) 2022-06-09 13:55:18 -07:00
Pamphile Roy
0bbc3379bd
Fix SciPy pinning (#25148)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-06-08 10:26:59 -07:00
Yi Cheng
aabe9e73ef
Revert "[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027)" (#25530)
This reverts commit 9a510f92cf.
2022-06-06 16:41:42 -07:00
Florian Boucault
9a510f92cf
[Serve] Depend on uvicorn[standard] instead of uvicorn so that it pulls in uvloop (#25027) 2022-06-06 14:23:00 -07:00
Yi Cheng
cb1f08a3c1
[core] Basic end-2-end multi-node tests for GCS HA in CI. (#25114)
In this PR we simulate the case where serve can continue to function even when GCS is down and the reconfig continue to work once GCS is back.

To make it close to the real-world case, the docker is used for isolation:

It starts a head node (0 cpus) and a worker node
It tried the basic function and make sure it's working
It kills GCS and make sure everything is working.
It starts GCS and make sure reconfig continues to work.
This is the basic cases for serve HA. We'll add more once we get better integrations.
2022-06-02 02:41:38 +00:00
SangBin Cho
ca75570f51
Revert "Revert "Revert "[dataset] Use polars for sorting (#24523)" (#24781)" (#25173)" (#25341)
This reverts commit 61676f26d3.
2022-06-01 10:49:12 -07:00
Stephanie Wang
61676f26d3
Revert "Revert "[dataset] Use polars for sorting (#24523)" (#24781)" (#25173)
Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed.

On my laptop, this makes sorting 1GB about 2x faster:

without polars

$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 50.23415923118591
...
Stage 2 sort: executed in 38.59s

        Substage 0 sort_map: 100/100 blocks executed
        * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total
        * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total
        * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
        * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

        Substage 1 sort_reduce: 100/100 blocks executed
        * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total
        * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total
        * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total
        * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

with polars

$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 24.097432136535645
...
Stage 2 sort: executed in 14.02s

        Substage 0 sort_map: 100/100 blocks executed
        * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total
        * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total
        * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
        * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

        Substage 1 sort_reduce: 100/100 blocks executed
        * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total
        * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total
        * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total
        * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

Related issue number

Closes #23612.
2022-05-27 10:43:51 -07:00
Kai Fricke
6dac517554
[ci] Protobuf < 4 only in requirements.txt to unblock CI (#25214) 2022-05-26 11:18:14 +02:00
xwjiang2010
8703d5e9d0
[air preprocessor] Add limit to OHE. (#24893) 2022-05-23 22:37:15 -07:00
Sven Mika
37799751df
[Serve + RLlib] Fix serve tutorial_rllib for Win. PyGame needs to be installed as of gym==0.23. (#25080) 2022-05-23 17:43:35 +02:00
Sven Mika
09886d7ab8
[RLlib] Upgrade gym 0.23 (#24171) 2022-05-23 08:18:44 +02:00
mwtian
502c3e132d
Revert "[Core] allow using grpcio > 1.44.0 (#23722)" (#24935)
This reverts commit b02029b29f.
2022-05-18 18:16:39 -07:00
Chen Shen
2be45fed5e
Revert "[dataset] Use polars for sorting (#24523)" (#24781)
This reverts commit c62e00e.

See if reverts this resolve linux://python/ray/tests:test_actor_advanced failure.
2022-05-13 12:09:12 -07:00
Stephanie Wang
c62e00ed6d
[dataset] Use polars for sorting (#24523)
Polars is significantly faster than the current pyarrow-based sort. This PR uses polars for the internal sort implementation if available. No API changes needed.

On my laptop, this makes sorting 1GB about 2x faster:

without polars

$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 50.23415923118591
...
Stage 2 sort: executed in 38.59s

        Substage 0 sort_map: 100/100 blocks executed
        * Remote wall time: 864.21ms min, 1.94s max, 1.4s mean, 140.39s total
        * Remote cpu time: 634.07ms min, 825.47ms max, 719.87ms mean, 71.99s total
        * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
        * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

        Substage 1 sort_reduce: 100/100 blocks executed
        * Remote wall time: 125.66ms min, 2.3s max, 1.09s mean, 109.26s total
        * Remote cpu time: 96.17ms min, 1.34s max, 725.43ms mean, 72.54s total
        * Output num rows: 178073 min, 2313038 max, 1250000 mean, 125000000 total
        * Output size bytes: 1446844 min, 18793434 max, 10156250 mean, 1015625046 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

with polars

$ python release/nightly_tests/dataset/sort.py --partition-size=1e7 --num-partitions=100
Dataset size: 100 partitions, 0.01GB partition size, 1.0GB total
Finished in 24.097432136535645
...
Stage 2 sort: executed in 14.02s

        Substage 0 sort_map: 100/100 blocks executed
        * Remote wall time: 165.15ms min, 595.46ms max, 398.01ms mean, 39.8s total
        * Remote cpu time: 349.75ms min, 423.81ms max, 383.29ms mean, 38.33s total
        * Output num rows: 1250000 min, 1250000 max, 1250000 mean, 125000000 total
        * Output size bytes: 10000000 min, 10000000 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

        Substage 1 sort_reduce: 100/100 blocks executed
        * Remote wall time: 21.21ms min, 472.34ms max, 232.1ms mean, 23.21s total
        * Remote cpu time: 29.81ms min, 460.67ms max, 238.1ms mean, 23.81s total
        * Output num rows: 114079 min, 2591410 max, 1250000 mean, 125000000 total
        * Output size bytes: 912632 min, 20731280 max, 10000000 mean, 1000000000 total
        * Tasks per node: 100 min, 100 max, 100 mean; 1 nodes used

Related issue number

Closes #23612.
2022-05-12 18:35:50 -07:00
Kai Yang
4a999777fa
[Core] Allow accepting gRPC HTTP proxy via env variable (#23526) 2022-05-10 11:30:46 +08:00
Kai Fricke
5d9bf4234a
[air] Example to track runs with Weights & Biases (#24459)
This PR 
- adds an example on how to run Ray Train and log results to weights & biases
- adds functionality to the W&B plugin to store checkpoints
- fixes a bug introduced in #24017
- Adds a CI utility script to setup credentials
- Adds a CI utility script to remove test state from external services cc @simon-mo
2022-05-06 15:52:37 +01:00
mwtian
b02029b29f
[Core] allow using grpcio > 1.44.0 (#23722) 2022-05-04 19:06:11 -07:00
Siyuan (Ryans) Zhuang
309fef68c5
[core] Fix internal storage S3 bugs (#24167)
* fix storage

* fix windows
2022-04-27 09:57:14 -07:00
SangBin Cho
30ab5458a7
[State Observability] Tasks and Objects API (#23912)
This PR implements ray list tasks and ray list objects APIs.

NOTE: You can ignore the merge conflict for now. It is because the first PR was reverted. There's a fix PR open now.
2022-04-21 18:45:03 -07:00
mwtian
2a5c40a149
[Core] remove Windows compatibility for Redis (#23991)
There should be no reference to Redis in Python anymore except parts of bootstrap code path.

closes #23982
2022-04-19 09:16:47 -07:00
Avnish Narayan
c9df6ce70c
[RLlib] Pinning gym to 0.21 to fix test issues (#24000) 2022-04-19 08:33:31 +02:00
Akash Patel
8eb99428ce
remove unmaintained blist (#23957)
This PR removes the unused `blist` dep. Causing issues during `py310` upgrade path.
2022-04-17 16:06:04 -07:00
Kai Fricke
e3bd59882d
[air] Move storage handling to pyarrow.fs.FileSystem (#23370) 2022-04-13 14:31:30 -07:00
Kai Fricke
d27e73f851
[ci] Pin prometheus_client to fix current test outages (#23749)
What: Pins prometheus_client to < 0.14.0, hopefully fixing today's CI outages
Why: New version of the python client (https://github.com/prometheus/client_python/releases) breaks our CI
2022-04-06 14:22:22 -07:00
Chen Shen
44114c8422
[CI] pin click version to fix broken test. #23544 2022-03-29 00:44:48 -07:00
ddelange
e109c13b83
[ci] Clean up ray-ml requirements (#23325)
In https://github.com/ray-project/ray/blob/ray-1.11.0/docker/ray-ml/Dockerfile, the order of pip install commands currently matters (potentially a lot). It would be good to run one big pip install command to avoid ending up with a broken env.

Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>
2022-03-25 15:59:54 +00:00
Yi Cheng
e3051ebf67
[ci] Fix grpcio 1.44 break test_output (#22494)
This PR limit grpc to be <= 1.42. This will fix testoutput.
2022-02-22 13:59:25 -08:00
Matti Picus
dfe4706d73
re-remove unused opencv-python-headless (#22470)
PR #16929 removed opencv-python-headless.
PR #17158 added it back but did not use it. This was noted by [a reviewer](https://github.com/ray-project/ray/pull/17158#issuecomment-982976429) since it breaks python3.9 (no wheel is available for installation).
2022-02-22 09:45:30 -08:00
Amog Kamsetty
04feea4afe
[rllib] Upper bound gym version (#22510)
gym had 0.22 release today which is breaking a lot of the rllib tests and examples. Temporarily pins gym version for now.
2022-02-18 17:39:22 -08:00
Jialing He
4c73560b31
[runtime env] Support clone virtualenv from an existing virtualenv (#22309)
Before this PR, we can't run ray in virtualenv, cause `runtime_env` does not support create a new virtualenv  from an existing virtualenv.

More details:https://github.com/ray-project/ray/pull/21801#discussion_r796848499

Co-authored-by: 捕牛 <hejialing.hjl@antgroup.com>
2022-02-15 12:51:01 -06:00
Chen Shen
bb6cb0898b
[Dataset] avoid pyarrow 7.0.0 for dataset (#22253) 2022-02-11 00:32:47 -08:00
Liu Bao
824453dd17
[runtime env] Create virtualenv for pip runtime env. (#21801) 2022-02-10 12:25:18 -06:00
Max Pumperla
092598774a
[Docs] Executable notebook tutorial (#22030)
We're introducing the usage of [MyST Notebooks](https://myst-nb.readthedocs.io/en/latest/index.html) here and demonstrate how it works by rewriting (and extending) the RLLib Serve tutorial. Benefits:

- [x] Write notebooks in markdown. Can be converted into other formats e.g. with `jupytext`
- [x] Tutorials like this have a binderhub link added to the top nav (launch button).
- [x] Notebooks get executed when docs are built, so it's impossible to have stale docs.
- [x] But locally those builds are cached so that you don't have to wait too long.
- [x] The notebook cell outputs can be shown, hidden or removed.  In particular, we can now avoid adding expected code output as comments in our scripts (which might get outdated).

We're also clarifying  #22022. 

Old tutorial: [here](https://docs.ray.io/en/latest/serve/tutorials/rllib.html)
New tutorial (preview): [here](https://ray--22030.org.readthedocs.build/en/22030/serve/tutorials/rllib.html)

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-02-03 08:13:04 +00:00
Archit Kulkarni
26057c433f
[CI] pin uvicorn to 0.16.0 to fix serve (#21612) 2022-01-14 16:00:51 -08:00
mwtian
cf6a54ca46
[CI] pin pytest-asyncio (#21579) 2022-01-13 11:35:30 -08:00
Akash Patel
cbcd03b779
Upgrade cython to 0.29.26 for py310 (#21244) 2021-12-26 20:26:08 -08:00
Scott Graham
7153d58cbd
Updates to azure autoscaler for authentication and dependency updates (#19603)
* updating azure autoscaler versions and backwards compatibility, and moving to azure-identity based authentication

* adding azure sdk rqmts for tests

* updating azure test requirements and adding wrapper function for azure sdk function resolution

* adding docstring to get_azure_sdk_function

Co-authored-by: Scott Graham <scgraham@microsoft.com>
2021-12-16 09:23:32 -08:00
Hankpipi
67518bdc50
[serve] Reconfiguration bug fix (#20315)
As described in #18884, reconfiguration will mutate state mid-query. I try to solve this problem by adding read/write lock to each replica.

Co-authored-by: yuzihao.2001 <yuzihao.2001@bytedance.com>
2021-12-07 18:53:45 -08:00
shrekris-anyscale
a91ddbdeb9
Add smart_open dependency to ray[default] (#20420) 2021-11-18 10:00:30 -06:00
Simon Mo
5f2b035bba
Pin Redis version to < 4.0.0 (#20430)
<!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. -->

## Why are these changes needed?

This pin is needed to fix `test_output` on master, which broke when 4.0.0 was released. 

It may also fix the windows build (unsure). 

## Related issue number

<!-- For example: "Closes #1234" -->

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-16 10:48:36 -08:00
Teofilo Zosa
abf0eb53cc
Fix aiohttp 3.8.0 breaking changes (and unpin from 3.7) (#20261) 2021-11-11 15:35:20 -08:00
Tobias Kaymak
893f57591d
[serve] Add Google Cloud Storage as a backend (#20104) 2021-11-10 19:45:19 -08:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
Oscar Knagg
5a05e89267
[Core] Add TLS/SSL support to gRPC channels (#18631) 2021-10-20 22:39:11 -07:00
Eric Liang
1bb2b1fc49
[hotfix] Pin pyspark dep to 3.1.2 2021-10-18 13:10:06 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows (#19319) 2021-10-14 14:42:22 -07:00
Antoni Baum
3cb0862152
Fix double gym in requirements (#19357) 2021-10-13 21:43:41 +01:00