Commit graph

10433 commits

Author SHA1 Message Date
xwjiang2010
866fa9590f
[tune] clean up legacy branch in update_avail_resources. (#20071) 2021-11-05 10:28:46 -07:00
matthewdeng
78e9ff7c91
[train][datasets] add example for big data training (#20042)
* [train][datasets] add example for big data training

* add title docstring

* lint and dependencies

* add dask_ml requirement
2021-11-05 09:28:48 -07:00
Chen Shen
320f9dc234
[Core][CoreWorker] increase the default port range (#19541)
* increase the port range

* Update doc/source/configure.rst

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2021-11-05 09:25:44 -07:00
Alex Wu
146b3d6bcc
[scheduler] Include depth and function descriptor in scheduling class (#20004) 2021-11-05 08:19:48 -07:00
Simon Mo
3d5cbc6e62
[Serve] Fix HTTP error handling behavior and add tests (#20093) 2021-11-05 10:15:54 -05:00
Sven Mika
a931076f59
[RLlib] Tf2 + eager-tracing same speed as framework=tf; Add more test coverage for tf2+tracing. (#19981) 2021-11-05 16:10:00 +01:00
gjoliver
1341bb59bf
[RLlib; Release testing] long_running_tests should use RLlib's app_config. (#20095) 2021-11-05 15:18:56 +01:00
SangBin Cho
8299aae918
[Placement Group] Add stats to pg scheduling (#19841)
* Add an e2e stats to pg scheduling

* Fix bugs.

* fix a bug.

* Revert "fix a bug."

This reverts commit dd7e03d1346fa39e54898effaaf8a2771103176e.

* done except unit tests.

* done except unit tests.

* Add unit tests.

* Address code review.

* done

* Fix

* done

* Fixed the test
2021-11-05 06:51:42 -07:00
Sven Mika
f3397b6f48
[RLlib] Minor fixes/cleanups; chop_into_sequences now handles nested data. (#19408) 2021-11-05 14:39:28 +01:00
Amog Kamsetty
adb8d77b2b
[Deps] Bump tensorflow on Docker image and add Codeowners (#20041) 2021-11-05 00:58:34 -07:00
dependabot[bot]
60e9737679
[tune](deps): Bump mlflow in /python/requirements/ml (#19913)
Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.19.0 to 1.21.0.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/mlflow/mlflow/compare/v1.19.0...v1.21.0)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:37:01 -07:00
dependabot[bot]
9897ee0eab
[tune](deps): Bump onnxruntime in /python/requirements/ml (#19666)
Bumps [onnxruntime](https://github.com/microsoft/onnxruntime) from 1.8.0 to 1.9.0.
- [Release notes](https://github.com/microsoft/onnxruntime/releases)
- [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md)
- [Commits](https://github.com/microsoft/onnxruntime/compare/v1.8.0...v1.9.0)

---
updated-dependencies:
- dependency-name: onnxruntime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:34:48 -07:00
dependabot[bot]
f214c4a4ab
[tune](deps): Bump datasets from 1.11.0 to 1.14.0 in /python/requirements/ml (#19645)
* [tune](deps): Bump datasets in /python/requirements/ml

Bumps [datasets](https://github.com/huggingface/datasets) from 1.11.0 to 1.14.0.
- [Release notes](https://github.com/huggingface/datasets/releases)
- [Commits](https://github.com/huggingface/datasets/compare/1.11.0...1.14.0)

---
updated-dependencies:
- dependency-name: datasets
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements_tune.txt

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-11-04 23:33:55 -07:00
Clark Zinzow
6ade6f0be6
[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. (#20044) 2021-11-04 22:48:49 -07:00
mwtian
fb0ede38ba
[CI] [macOS] avoid installing latest setuptools (#20064) 2021-11-04 21:35:03 -07:00
architkulkarni
c5175073b2
[runtime env] Add garbage collection for conda envs (#20072) 2021-11-04 23:13:34 -05:00
Edward Oakes
360993612c
[serve] Remove lingering backend references (#20085) 2021-11-04 20:32:13 -05:00
Eric Liang
6102912494
Dataset doc updates (#19815) 2021-11-04 18:13:40 -07:00
SangBin Cho
44b38e9aa1
Add Chaos testing fixture + test actor tasks chaos test in CI (#19975)
* Basic CI tests done

* Fix an issue

* shutdown to conftest

* Addressed code review.
2021-11-04 16:27:35 -07:00
Simon Mo
4d583da7d5
[Serve] Add verbose log for nightly test only (#20088) 2021-11-04 16:15:22 -07:00
Edward Oakes
65161fe9b4
[job submission] Move HTTP routes to /api/jobs prefix (#19995) 2021-11-04 17:45:25 -05:00
javi-redondo
11371768c1
Update Ray client docs with working_dir explanation (#18294) 2021-11-04 14:52:28 -07:00
SangBin Cho
56bab61fba
[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680)
* done

* Make it work

* Fix issues

* done

* try

* done

* Fix remaining bugs.
2021-11-04 14:41:00 -07:00
Eric Liang
585d472fdf
Add configuration context to dataset (#19907) 2021-11-04 14:36:51 -07:00
Alex Wu
4ffb7ccfac
[scheduler][cleanup] Remove one cpu optimization (#20022)
* .

* remove test

* Update cluster_task_manager.cc

* Update cluster_task_manager.cc

* lint

* lint

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-11-04 14:18:13 -07:00
Edward Oakes
49d308138f
[serve] Rename backend_state -> deployment_state (#20040) 2021-11-04 15:46:45 -05:00
javi-redondo
5781f44cc9
Update config.rst to reflect min nodes on autoscaling up (#15589) 2021-11-04 13:40:50 -07:00
Yi Cheng
04f60c998e
[nightly] Fix pytest missing in nightly test (#20076)
## Why are these changes needed?
In the nightly test we see
```
Command returned non-success status: 1; Command logs:Traceback (most recent call last): File "dask_on_ray/large_scale_test.py", line 17, in from ray._private.test_utils import monitor_memory_usage File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/test_utils.py", line 18, in import pytest ModuleNotFoundError: No module named 'pytest'
```
This PR fixes this error.

## Related issue number
2021-11-04 13:38:05 -07:00
Philipp Moritz
a64e32c53b
[docs] Fix broken links in documentation and add linkcheck to documentation (#20030)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-11-04 13:19:43 -07:00
Sven Mika
50c30f89c6
[Tune; RLlib] Move Tune tests that use RLlib into separate buildkite job. (#20016) 2021-11-04 20:40:57 +01:00
Jiao
6cfb52ff1d
[job submission] Add stop API + subprocess cleanup (#19860) 2021-11-04 13:59:47 -05:00
Alex Wu
36a214386f
[docs] PyData ray dataset talk (#20038) 2021-11-04 10:33:18 -07:00
Yi Cheng
65d3054a09
[build] fix the wrong flag for gcs ha test (#20052)
## Why are these changes needed?
It should be `RAY_gcs_grpc_based_pubsub` instead of `Ray_gcs_grpc_based_pubsub`

## Related issue number
2021-11-04 09:59:11 -07:00
Yi Cheng
7bb4c87780
[gcs] use gcs kv in internal kv (#19933)
## Why are these changes needed?
It's part of redis removal project. This PR focus on using gcs kv in internal kv.

- gcs client is introduced
- internal kv is updated to use gcs rpc client based kv
- related code got updated.

The other PR will update components using redis to use internal kv.

## Related issue number
https://github.com/ray-project/ray/issues/19443
2021-11-04 09:57:39 -07:00
Chen Shen
5c0e012ba3
[Core][Core-worker] de-escalate the error message when worker is accessed after shutdown. (#20049)
* de-escalate

* lint
2021-11-04 09:56:39 -07:00
Yi Cheng
b3b88a46f7
[pg] Fix the test case which hangs because of scheduling dead lock (#20048)
## Why are these changes needed?
In this test case, the following case could happen:

1. actor creation first uses all resource in local node which is a GPU node
2. the actor need GPU will not be able to be scheduled since we only have one GPU node

The fixing is just a short term fix and only tries to connect to the head node with CPU resources.

## Related issue number
#19438
2021-11-04 09:56:23 -07:00
Amog Kamsetty
f67b526b7a
[Tune] Fix PTL tutorial docs (#19999) 2021-11-04 09:21:28 -07:00
xwjiang2010
f1179cbccd
[tune] Remove unused clean_trial_placement_group. (#19960) 2021-11-04 08:55:42 -07:00
architkulkarni
bcb63961d9
[runtime env] Add plugin name to internal URI format and add GC for py_modules (#20009) 2021-11-04 10:16:14 -05:00
Jiajun Yao
7e013366ac
[scheduler] Update local object store usage (#20026)
* Update local object store usage

* Update local object store usage

* Update local object store usage
2021-11-04 08:11:32 -07:00
Sven Mika
4cb23d1c95
[Tune; Testing] Revert to 3.7 (undone by accident by previous PR); + some minor comment cleanups. (#20031) 2021-11-04 10:58:34 +01:00
mwtian
a26474156d
Use GCC 9 in GPU docker (#20024) 2021-11-03 22:53:17 -07:00
SangBin Cho
8d115b96b5
[Tests] Try deflaking test placement group mini integration. (#19886)
* done

* fix
2021-11-03 20:54:59 -07:00
Qing Wang
4373aa1e3b
Support generating a UUID string as the anonymous namespace for Java worker. (#19986)
Why are these changes needed?
For Java worker, we generate a UUID string as the namespace if a job is not specified a namespace by user.

Related issue number
#16474
2021-11-04 11:40:17 +08:00
Alex Wu
b52cc6ecdc
[scheduler] Fix dynamically increasing resources (#20036) 2021-11-03 17:28:44 -07:00
gjoliver
2c1fa459d4
[RLlib] Add an RLlib Tune experiment to UserTest suite. (#19807)
* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* Add an RLlib Tune experiment to UserTest suite.

* Add ray.init()

* Move example script to example/tune/, so it can be imported as module.

* add __init__.py so our new module will get included in python wheel.

* Add block device to RLlib test instances.

* Reduce disk size a little bit.

* Add metrics reporting

* Allow max of 5 workers to accomodate all the worker tasks.

* revert disk size change.

* Minor updates

* Trigger build

* set max num workers

* Add a compute cfg for autoscaled cpu and gpu nodes.

* use 1gpu instance.

* install tblib for debugging worker crashes.

* Manually upgrade to pytorch 1.9.0

* -y

* torch=1.9.0

* install torch on driver

* bump timeout

* Write a more informational result dict.

* Revert changes to compute config files that are not used.

* add smoke test

* update

* reduce timeout

* Reduce the # of env per worker to 1.

* Small fix for getting trial_states

* Trigger build

* simply result dict

* lint

* more lint

* fix smoke test

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-11-03 17:04:27 -07:00
Edward Oakes
91c730efd0
[serve] Rename backend -> deployment in replica.py (#20020) 2021-11-03 17:46:10 -05:00
Amog Kamsetty
ede9d0ed76
[CI] Pin keras (#20032)
* try fix

* try again

* revert back

* add todo
2021-11-03 15:32:10 -07:00
Clark Zinzow
a0841106ff
[Datasets] Follow-up to groupby standard deviation PR (#20035) 2021-11-03 13:56:34 -07:00
SangBin Cho
eacaff5d8d
[Core] Try scheduling tasks when the local resource creation is updated (#20019)
## Why are these changes needed?

Check https://github.com/ray-project/ray/pull/19996/files#r741963616

## Related issue number
2021-11-03 12:29:51 -07:00