Commit graph

10224 commits

Author SHA1 Message Date
Amog Kamsetty
adb8d77b2b
[Deps] Bump tensorflow on Docker image and add Codeowners (#20041) 2021-11-05 00:58:34 -07:00
dependabot[bot]
60e9737679
[tune](deps): Bump mlflow in /python/requirements/ml (#19913)
Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.19.0 to 1.21.0.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/mlflow/mlflow/compare/v1.19.0...v1.21.0)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:37:01 -07:00
dependabot[bot]
9897ee0eab
[tune](deps): Bump onnxruntime in /python/requirements/ml (#19666)
Bumps [onnxruntime](https://github.com/microsoft/onnxruntime) from 1.8.0 to 1.9.0.
- [Release notes](https://github.com/microsoft/onnxruntime/releases)
- [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md)
- [Commits](https://github.com/microsoft/onnxruntime/compare/v1.8.0...v1.9.0)

---
updated-dependencies:
- dependency-name: onnxruntime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:34:48 -07:00
dependabot[bot]
f214c4a4ab
[tune](deps): Bump datasets from 1.11.0 to 1.14.0 in /python/requirements/ml (#19645)
* [tune](deps): Bump datasets in /python/requirements/ml

Bumps [datasets](https://github.com/huggingface/datasets) from 1.11.0 to 1.14.0.
- [Release notes](https://github.com/huggingface/datasets/releases)
- [Commits](https://github.com/huggingface/datasets/compare/1.11.0...1.14.0)

---
updated-dependencies:
- dependency-name: datasets
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements_tune.txt

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-11-04 23:33:55 -07:00
Clark Zinzow
6ade6f0be6
[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. (#20044) 2021-11-04 22:48:49 -07:00
mwtian
fb0ede38ba
[CI] [macOS] avoid installing latest setuptools (#20064) 2021-11-04 21:35:03 -07:00
architkulkarni
c5175073b2
[runtime env] Add garbage collection for conda envs (#20072) 2021-11-04 23:13:34 -05:00
Edward Oakes
360993612c
[serve] Remove lingering backend references (#20085) 2021-11-04 20:32:13 -05:00
Eric Liang
6102912494
Dataset doc updates (#19815) 2021-11-04 18:13:40 -07:00
SangBin Cho
44b38e9aa1
Add Chaos testing fixture + test actor tasks chaos test in CI (#19975)
* Basic CI tests done

* Fix an issue

* shutdown to conftest

* Addressed code review.
2021-11-04 16:27:35 -07:00
Simon Mo
4d583da7d5
[Serve] Add verbose log for nightly test only (#20088) 2021-11-04 16:15:22 -07:00
Edward Oakes
65161fe9b4
[job submission] Move HTTP routes to /api/jobs prefix (#19995) 2021-11-04 17:45:25 -05:00
javi-redondo
11371768c1
Update Ray client docs with working_dir explanation (#18294) 2021-11-04 14:52:28 -07:00
SangBin Cho
56bab61fba
[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680)
* done

* Make it work

* Fix issues

* done

* try

* done

* Fix remaining bugs.
2021-11-04 14:41:00 -07:00
Eric Liang
585d472fdf
Add configuration context to dataset (#19907) 2021-11-04 14:36:51 -07:00
Alex Wu
4ffb7ccfac
[scheduler][cleanup] Remove one cpu optimization (#20022)
* .

* remove test

* Update cluster_task_manager.cc

* Update cluster_task_manager.cc

* lint

* lint

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-11-04 14:18:13 -07:00
Edward Oakes
49d308138f
[serve] Rename backend_state -> deployment_state (#20040) 2021-11-04 15:46:45 -05:00
javi-redondo
5781f44cc9
Update config.rst to reflect min nodes on autoscaling up (#15589) 2021-11-04 13:40:50 -07:00
Yi Cheng
04f60c998e
[nightly] Fix pytest missing in nightly test (#20076)
## Why are these changes needed?
In the nightly test we see
```
Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest'
```
This PR fixes this error.

## Related issue number
2021-11-04 13:38:05 -07:00
Philipp Moritz
a64e32c53b
[docs] Fix broken links in documentation and add linkcheck to documentation (#20030)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-11-04 13:19:43 -07:00
Sven Mika
50c30f89c6
[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016) 2021-11-04 20:40:57 +01:00
Jiao
6cfb52ff1d
[job submission] Add stop API + subprocess cleanup (#19860) 2021-11-04 13:59:47 -05:00
Alex Wu
36a214386f
[docs] PyData ray dataset talk (#20038) 2021-11-04 10:33:18 -07:00
Yi Cheng
65d3054a09
[build] fix the wrong flag for gcs ha test (#20052)
## Why are these changes needed?
It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub`

## Related issue number
2021-11-04 09:59:11 -07:00
Yi Cheng
7bb4c87780
[gcs] use gcs kv in internal kv (#19933)
## Why are these changes needed?
It's part of redis removal project. This PR focus on using gcs kv in internal kv.

- gcs client is introduced
- internal kv is updated to use gcs rpc client based kv
- related code got updated.

The other PR will update components using redis to use internal kv.

## Related issue number
https://github.com/ray-project/ray/issues/19443
2021-11-04 09:57:39 -07:00
Chen Shen
5c0e012ba3
[Core][Core-worker] de-escalate the error message when worker is accessed after shutdown. (#20049)
* de-escalate

* lint
2021-11-04 09:56:39 -07:00
Yi Cheng
b3b88a46f7
[pg] Fix the test case which hangs because of scheduling dead lock (#20048)
## Why are these changes needed?
In this test case, the following case could happen:

1. actor creation first uses all resource in local node which is a GPU node
2. the actor need GPU will not be able to be scheduled since we only have one GPU node

The fixing is just a short term fix and only tries to connect to the head node with CPU resources.

## Related issue number
#19438
2021-11-04 09:56:23 -07:00
Amog Kamsetty
f67b526b7a
[Tune] Fix PTL tutorial docs (#19999) 2021-11-04 09:21:28 -07:00
xwjiang2010
f1179cbccd
[tune] Remove unused clean_trial_placement_group. (#19960) 2021-11-04 08:55:42 -07:00
architkulkarni
bcb63961d9
[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009) 2021-11-04 10:16:14 -05:00
Jiajun Yao
7e013366ac
[scheduler] Update local object store usage (#20026)
* Update local object store usage

* Update local object store usage

* Update local object store usage
2021-11-04 08:11:32 -07:00
Sven Mika
4cb23d1c95
[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031) 2021-11-04 10:58:34 +01:00
mwtian
a26474156d
Use GCC 9 in GPU docker (#20024) 2021-11-03 22:53:17 -07:00
SangBin Cho
8d115b96b5
[Tests] Try deflaking test placement group mini integration. (#19886)
* done

* fix
2021-11-03 20:54:59 -07:00
Qing Wang
4373aa1e3b
Support generating a UUID string as the anonymous namespace for Java worker. (#19986)
Why are these changes needed?
For Java worker, we generate a UUID string as the namespace if a job is not specified a namespace by user.

Related issue number
#16474
2021-11-04 11:40:17 +08:00
Alex Wu
b52cc6ecdc
[scheduler] Fix dynamically increasing resources (#20036) 2021-11-03 17:28:44 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
Edward Oakes
91c730efd0
[serve] Rename backend -> deployment in replica.py (#20020) 2021-11-03 17:46:10 -05:00
Amog Kamsetty
ede9d0ed76
[CI] Pin keras (#20032)
* try fix

* try again

* revert back

* add todo
2021-11-03 15:32:10 -07:00
Clark Zinzow
a0841106ff
[Datasets] Follow-up to groupby standard deviation PR (#20035) 2021-11-03 13:56:34 -07:00
SangBin Cho
eacaff5d8d
[Core] Try scheduling tasks when the local resource creation is updated (#20019)
## Why are these changes needed?

Check https://github.com/ray-project/ray/pull/19996/files#r741963616

## Related issue number
2021-11-03 12:29:51 -07:00
Richard Hamnett
f4256a4ddc
[Doc] Update installation.rst for nightly build (#20034)
Ensure clean removal of previous ray nightly before updating.
2021-11-03 12:05:07 -07:00
mwtian
f83195a1e1
[Build] Add GCS HA builds (#20008)
## Why are these changes needed?
Add builds for Python tests with GCS pubsub enabled.

## Related issue number
2021-11-03 11:58:16 -07:00
Clark Zinzow
665954d48c
Add standard deviation aggregation. (#20010) 2021-11-03 11:38:23 -07:00
Yi Cheng
555b87d552
[gcs] Enable grpc based broadcasting by default (#19716)
## Why are these changes needed?
This is part of redis removal project. This PR is going to enable grpc based broadcasting by default.

## Related issue number

<!-- For example: "Closes #1234" -->
#19438 
## Checks
2021-11-03 10:02:37 -07:00
Alex Wu
3d7d341dd0
[test] Fix test_actor_scheduling_not_block_with_placement_group (missing num_cpus=1) (#20006) 2021-11-03 09:08:50 -07:00
Jiajun Yao
5de4a38948
[CI] Run Java CI on Mac (#19757)
Why are these changes needed?
Enable Java tests on Mac CI to avoid more breakages.

Related issue number
Closes #19700
2021-11-03 23:40:05 +08:00
Avnish Narayan
026bf01071
[RLlib] Upgrade gym version to 0.21 and deprecate pendulum-v0. (#19535)
* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7

* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

* Reformatting

* Fixing tests

* Move atari-py install conditional to req.txt

* migrate to new ale install method

* Fix QMix, SAC, and MADDPA too.

* Unpin gym and deprecate pendulum v0

Many tests in rllib depended on pendulum v0,
however in gym 0.21, pendulum v0 was deprecated
in favor of pendulum v1. This may change reward
thresholds, so will have to potentially rerun
all of the pendulum v1 benchmarks, or use another
environment in favor. The same applies to frozen
lake v0 and frozen lake v1

Lastly, all of the RLlib tests and have
been moved to python 3.7
* Add gym installation based on python version.

Pin python<= 3.6 to gym 0.19 due to install
issues with atari roms in gym 0.20

Move atari-py install conditional to req.txt

migrate to new ale install method

Make parametric_actions_cartpole return float32 actions/obs

Adding type conversions if obs/actions don't match space

Add utils to make elements match gym space dtypes

Co-authored-by: Jun Gong <jungong@anyscale.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-11-03 16:24:00 +01:00
Edward Oakes
e1e0cb5eaa
[serve] Rename backend tag -> deployment name (#19997) 2021-11-03 09:49:52 -05:00
Edward Oakes
b2ddea255d
[job submission] Add job submission ID + status to /api/snapshot (#19994) 2021-11-03 09:49:28 -05:00