Commit graph

5582 commits

Author SHA1 Message Date
architkulkarni
923131ba37
[runtime env] Enable reference counting for URIs for actors (#20165) 2021-11-10 10:52:03 -08:00
matthewdeng
790e22f9ad
[tune] move force_on_current_node to ml_utils (#20211) 2021-11-10 10:21:24 -08:00
DK.Pino
20f126896e
[Placement Group] [Test] Add fractional resources test for placement group (#20185)
* add fractional resources test

* lint
2021-11-10 07:25:49 -08:00
Kai Fricke
4e3e213549
[tune] Allow more versatile experiment analysis loading (#20181) 2021-11-10 11:46:27 +00:00
DK.Pino
2c41936a39
[Placement Group] [Test] Fix pg.ready hang forever when gcs restarting (#20063)
* fixed

* lint

* fix comment

* revert previous fix code
2021-11-10 00:53:42 -08:00
SangBin Cho
3bae6b94b3
[test] Fix flaky chaos_test.py (#20202)
* Fix

* fix lint
2021-11-10 00:23:55 -08:00
Edward Oakes
5475bb054c
[job submission] Redirect stdout + stderr to a single log file (#20208) 2021-11-09 22:34:12 -08:00
Jiajun Yao
5ffa0bb01f
Listen on 127.0.0.1 if node ip is 127.0.0.1 (#20190) 2021-11-09 20:24:05 -08:00
Sungho Joo
dc51af798c
[RLlib] Minor fix on json encoding during worker sampling (#20134)
* import custom json encoder from util and improve encoder default function

* linting
2021-11-09 16:46:41 -08:00
matthewdeng
33af739bf2
[train] add placement group support (#20091)
* [train] add placement group support

* fix additional resources

* fix tests

* add comment to add_workers
2021-11-09 16:36:07 -08:00
Edward Oakes
f6399e3389
[job submission] Remove jobs intermediate directory for logs (#20192) 2021-11-09 16:20:40 -08:00
Edward Oakes
39b3eb9763
[serve] Don't halt main control loop due to exceptions in snapshot logic (#20151) 2021-11-09 14:46:15 -08:00
Zyiqin-Miranda
333d0b43fd
[autoscaler] AWS Autoscaler CloudWatch Integration (#18619) 2021-11-09 11:48:55 -08:00
SangBin Cho
b0550aa440
[Core] Fix the named actor get or create race condition (#20126)
* Fix done.

* Fixed.

* clean up

* Done
2021-11-09 02:27:54 -08:00
Edward Oakes
c04e5af1eb
[job submission] Rename log files to job-driver-{job_id}.{out,err} (#20170) 2021-11-08 23:10:56 -08:00
Edward Oakes
50f2cf8a74
[job submission] Allow passing job_id, return DOES_NOT_EXIST when applicable (#20164) 2021-11-08 23:10:27 -08:00
Jiao
d46caa9856
[job submission] Remove test_utils dependency (#20168)
Co-authored-by: Jiao Dong <jiaodong@anyscale.com>
2021-11-08 23:08:43 -08:00
SangBin Cho
5c4fb4dc91
[Core]Chaos testing nightly (#20059)
* Done initial stage.

* lint

* .

* Finished.

* Fix lint
2021-11-08 21:57:53 -08:00
Stephanie Wang
ffcc5935d7
[core] Evict lineage to bound memory usage (#19946)
* bound lineage

* Bound lineage in bytes

* test

* Lineage evicted error

* Lineage evicted

* lint

* test

* test

* comment

* doc

* x

* x

* x

* x
2021-11-08 21:53:40 -08:00
architkulkarni
e5e62d8991
[runtime env] Fix runtime env conda test and enable it in CI (#20121) 2021-11-08 18:33:19 -08:00
Lixin Wei
8e666ca1e9
[Core] Fix Used Memory Calculation (#20127)
* fix memory

* fix
2021-11-08 17:36:32 -08:00
Kai Fricke
9c2b8c8501
[tune] Deprecate DurableTrainable (#19880) 2021-11-08 20:56:07 +00:00
Amog Kamsetty
f8430e6eca
[CI] Pin shortuuid to fix CI (#20153) 2021-11-08 12:08:32 -08:00
Amog Kamsetty
b1f24768a1
[Tune] More fixes to PTL Tutorial (#20065)
* ptl-fix-2

* improve

* fix
2021-11-08 09:13:44 -08:00
Gagandeep Singh
31812d026c
Bumped time limit for test_worker_startup_count in test_basic_3.py (#20056)
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2021-11-08 09:02:28 -08:00
Kai Yang
e84391d1d3
[Core] Encode job ID in randomized task IDs for user-created threads (#19320)
## Why are these changes needed?

Currently, when `WorkerContext::GetCurrentTaskID()` returns a random task ID in user-created threads, and the returned task ID doesn't include the job ID. In this case, subsequent non-actor tasks and return values, and objects created by `ray.put()` don't include the job ID neither. This makes us hard to find the correct job ID from a task or object ID.

This PR updates the task ID generation code to always encode the job ID.

A side-effect of this PR is the change of possibility of task ID collision in user-created threads due to the fixed job ID part. w/o this PR: `sqrt(pi * 256 ^ 12 / 2)` ~= 352 trillion tasks. w/ this PR: `sqrt(pi * 256 ^ 8 / 2)` ~= 5 billion tasks. But this should be OK because the job ID part of task IDs in non-user-created threads are always fixed, so it won't be worse than non-user-created threads.

## Related issue number

## Checks

- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(
2021-11-08 21:00:40 +08:00
Linsong Chu
e189d8d4bc
[workflow] fix s3 storage path (#20115)
## Why are these changes needed?

To fix two path related issues when s3 is used as storage backend:
1.  a leading slash will be added to the path due to the behavior of `parse.urlparse`.
2. When `step_id=""`, double slashes will be added in the path.

Details are explained in https://github.com/ray-project/ray/issues/20114

## Related issue number

https://github.com/ray-project/ray/issues/20114
https://github.com/ray-project/ray/issues/19027
2021-11-07 15:57:33 -08:00
dependabot[bot]
adf39941f4
[data](deps): Bump dask[complete] (#20125)
Bumps [dask[complete]](https://github.com/dask/dask) from 2021.9.1 to 2021.11.0.
- [Release notes](https://github.com/dask/dask/releases)
- [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md)
- [Commits](https://github.com/dask/dask/compare/2021.09.1...2021.11.0)

---
updated-dependencies:
- dependency-name: dask[complete]
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-07 11:55:39 -08:00
xwjiang2010
866fa9590f
[tune] clean up legacy branch in update_avail_resources. (#20071) 2021-11-05 10:28:46 -07:00
matthewdeng
78e9ff7c91
[train][datasets] add example for big data training (#20042)
* [train][datasets] add example for big data training

* add title docstring

* lint and dependencies

* add dask_ml requirement
2021-11-05 09:28:48 -07:00
Chen Shen
320f9dc234
[Core][CoreWorker] increase the default port range (#19541)
* increase the port range

* Update doc/source/configure.rst

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2021-11-05 09:25:44 -07:00
Alex Wu
146b3d6bcc
[scheduler] Include depth and function descriptor in scheduling class (#20004) 2021-11-05 08:19:48 -07:00
Simon Mo
3d5cbc6e62
[Serve] Fix HTTP error handling behavior and add tests (#20093) 2021-11-05 10:15:54 -05:00
SangBin Cho
8299aae918
[Placement Group] Add stats to pg scheduling (#19841)
* Add an e2e stats to pg scheduling

* Fix bugs.

* fix a bug.

* Revert "fix a bug."

This reverts commit dd7e03d1346fa39e54898effaaf8a2771103176e.

* done except unit tests.

* done except unit tests.

* Add unit tests.

* Address code review.

* done

* Fix

* done

* Fixed the test
2021-11-05 06:51:42 -07:00
Amog Kamsetty
adb8d77b2b
[Deps] Bump tensorflow on Docker image and add Codeowners (#20041) 2021-11-05 00:58:34 -07:00
dependabot[bot]
60e9737679
[tune](deps): Bump mlflow in /python/requirements/ml (#19913)
Bumps [mlflow](https://github.com/mlflow/mlflow) from 1.19.0 to 1.21.0.
- [Release notes](https://github.com/mlflow/mlflow/releases)
- [Changelog](https://github.com/mlflow/mlflow/blob/master/CHANGELOG.rst)
- [Commits](https://github.com/mlflow/mlflow/compare/v1.19.0...v1.21.0)

---
updated-dependencies:
- dependency-name: mlflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:37:01 -07:00
dependabot[bot]
9897ee0eab
[tune](deps): Bump onnxruntime in /python/requirements/ml (#19666)
Bumps [onnxruntime](https://github.com/microsoft/onnxruntime) from 1.8.0 to 1.9.0.
- [Release notes](https://github.com/microsoft/onnxruntime/releases)
- [Changelog](https://github.com/microsoft/onnxruntime/blob/master/docs/ReleaseManagement.md)
- [Commits](https://github.com/microsoft/onnxruntime/compare/v1.8.0...v1.9.0)

---
updated-dependencies:
- dependency-name: onnxruntime
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2021-11-04 23:34:48 -07:00
dependabot[bot]
f214c4a4ab
[tune](deps): Bump datasets from 1.11.0 to 1.14.0 in /python/requirements/ml (#19645)
* [tune](deps): Bump datasets in /python/requirements/ml

Bumps [datasets](https://github.com/huggingface/datasets) from 1.11.0 to 1.14.0.
- [Release notes](https://github.com/huggingface/datasets/releases)
- [Commits](https://github.com/huggingface/datasets/compare/1.11.0...1.14.0)

---
updated-dependencies:
- dependency-name: datasets
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* Update requirements_tune.txt

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Amog Kamsetty <amogkam@users.noreply.github.com>
2021-11-04 23:33:55 -07:00
Clark Zinzow
6ade6f0be6
[Datasets] Multi-aggregations [1/3]: Add basic support for groupby multi-aggregations. (#20044) 2021-11-04 22:48:49 -07:00
mwtian
fb0ede38ba
[CI] [macOS] avoid installing latest setuptools (#20064) 2021-11-04 21:35:03 -07:00
architkulkarni
c5175073b2
[runtime env] Add garbage collection for conda envs (#20072) 2021-11-04 23:13:34 -05:00
Edward Oakes
360993612c
[serve] Remove lingering backend references (#20085) 2021-11-04 20:32:13 -05:00
Eric Liang
6102912494
Dataset doc updates (#19815) 2021-11-04 18:13:40 -07:00
SangBin Cho
44b38e9aa1
Add Chaos testing fixture + test actor tasks chaos test in CI (#19975)
* Basic CI tests done

* Fix an issue

* shutdown to conftest

* Addressed code review.
2021-11-04 16:27:35 -07:00
Simon Mo
4d583da7d5
[Serve] Add verbose log for nightly test only (#20088) 2021-11-04 16:15:22 -07:00
SangBin Cho
56bab61fba
[Placement group] Raise an exception when invalid resources are specified with the placement group. (#19680)
* done

* Make it work

* Fix issues

* done

* try

* done

* Fix remaining bugs.
2021-11-04 14:41:00 -07:00
Eric Liang
585d472fdf
Add configuration context to dataset (#19907) 2021-11-04 14:36:51 -07:00
Alex Wu
4ffb7ccfac
[scheduler][cleanup] Remove one cpu optimization (#20022)
* .

* remove test

* Update cluster_task_manager.cc

* Update cluster_task_manager.cc

* lint

* lint

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-11-04 14:18:13 -07:00
Edward Oakes
49d308138f
[serve] Rename backend_state -> deployment_state (#20040) 2021-11-04 15:46:45 -05:00
Philipp Moritz
a64e32c53b
[docs] Fix broken links in documentation and add linkcheck to documentation (#20030)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-11-04 13:19:43 -07:00