Commit graph

4594 commits

Author SHA1 Message Date
architkulkarni
8587f9d738
[Core] [runtime env] Fix conda/pip filepaths relative to working_dir (#16186) 2021-06-24 16:43:25 -05:00
architkulkarni
4637298d36
Delete conda env before creating to deflake test_runtime_env_complicated (#16628) 2021-06-24 12:13:26 -05:00
architkulkarni
e8c25a2fa4
[Core] [runtime env] Merge child's runtime_env["env_vars"] with that of parent (#16553) 2021-06-24 12:13:13 -05:00
Simon Mo
aabdfe2989
[Serve] Fix HTTP headers (#16647) 2021-06-24 11:59:43 -05:00
Amog Kamsetty
53d16365b0
[Release] Convert Horovod and SGD release tests (#15999) 2021-06-24 15:56:02 +01:00
Kai Fricke
ef97bdd407
[release] Fix app config: Install latest releases. Bump xgboost-ray version (#16581) 2021-06-24 12:56:21 +01:00
Gabriele Oliaro
3e2f608145
Work stealing! (#15475)
* work_stealing one commit squash

* using random task id to request workers

* inlining methods in direct_task_transport.h

* faster checking for presence of stealable tasks in RequestNewWorkerIfNeeded

* linting

* fixup! using random task id to request workers

* estimating number of tasks to steal based only on tasks in flight

* linting

* fixup! linting

* backup of changes

* fixed issue in scheduling queue test after merge

* linting

* redesigned work stealing. compiles but not tested

* all tests passing locally

* fixup! all tests passing locally

* fixup! fixup! all tests passing locally

* fixed big bug in StealTasksIfNeeded

* rev1

* rev2 (before removing the work_stealing param)

* removed work_stealing flag, fixed existing unit tests

* added unit tests; need to figure out how to assign distinct worker ids in GrantWorkerLease

* fixed work stealing test

* revisions, added two more unit/regression tests

* test
2021-06-23 17:08:28 -07:00
Frank Luan
9249287a36
Object spilling threshold (#16558)
* Object spilling threshold

* clang-format

* Make tests more lenient

* Fix tests

* Fix tests

* Address comments

* Fix tests lint

* Refactor

* Fix tests

* Fix cpp tests

* Address comments
2021-06-23 16:54:41 -07:00
SangBin Cho
f816f613c7
[Test] Handle flaky tests (#16602)
* Handle flaky tests.

* lint

* tag more

* add test_scheduling

* Remove global gc

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2021-06-23 16:24:12 -07:00
Amog Kamsetty
b9e5ca4c18
[tune] Deflake mnist_ptl_mini (#16555) 2021-06-23 14:26:40 -07:00
SangBin Cho
ccb02dacb6
Mark the global gc test unflaky (#16601) 2021-06-23 13:38:32 -07:00
architkulkarni
9cb65d5e2f
[Core] Move wheel URL utils from test_utils to utils (#16386) 2021-06-23 13:41:02 -05:00
chenk008
82d92d0d61
[Core]Use worker shim PID to check worker registration (#16398) 2021-06-22 21:12:53 -07:00
Kai Fricke
a1765ac627
[tune] move to local parameter registry for tune.with_parameters() (#16611) 2021-06-22 17:58:11 -07:00
Chris K. W
b4f2cbce02
[Client] Disconnect on dataclient error (#16588)
* disconnect when main thread finds dataclient shut down, update error messages

* Add test_dataclient_disconnect to small tests

* drop unused var

* add __main__ section to test

* avoid direct ray import

* rerun
2021-06-22 16:46:10 +03:00
Tao Wang
d1db4744e3
[large scale]Get next job id from gcs instead of redis - python part (#16528) 2021-06-22 14:06:30 +08:00
Stephanie Wang
e7b752cf33
[core] Fix bug in task dependency management for duplicate args (#16365)
* Pytest

* Skip on windows

* C++
2021-06-21 22:32:04 -07:00
SangBin Cho
5efeb5334b
Revert "Same worker id in python and c++ (#16568)" (#16600)
This reverts commit 9b5c0c32da.
2021-06-21 18:58:31 -07:00
Ian Rodney
d3832ab2e1
[Client] Fix gRPC Timeout Options (#16554) 2021-06-21 14:25:41 -07:00
Alex Wu
9b5c0c32da
Same worker id in python and c++ (#16568)
* .

* .

* test

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-21 13:22:52 -07:00
Siyuan (Ryans) Zhuang
b7995f66a4
[Workflow] Sync mode fault tolerance (#16282) 2021-06-21 10:05:27 -07:00
Qinghao Hu
d922a79385
[sgd] DataParallel after Apex init. (#15645)
* [FIX] DataParallel after Apex init.

* lint

Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-06-20 22:44:15 -07:00
lanlin
e5b50fcc9d
[tune] allow to read trial results from json files in Analysis (#15915)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2021-06-20 20:41:48 -07:00
Dmitri Gekhtman
cb878b6514
[doc][kubernetes] K8s doc updates (#16570) 2021-06-20 19:38:34 -07:00
Eric Liang
a0da009645
Allocate inbound object chunks using CreateRequestQueue instead of immediate allocation (#16523) 2021-06-20 09:22:12 -07:00
Yorick van Zweeden
db7e2c8f21
Remove outdated code from PopulationBasedTrainingReplay (#16564)
Co-authored-by: Yorick van Zweeden <git@yorickvanzweeden.nl>
2021-06-20 15:22:52 +02:00
Amog Kamsetty
e6d9f0b393
[Dask] Support Dask 2021.06.1 (#16547) 2021-06-19 18:22:23 -07:00
Achal Shah
eadee8aba7
[docs] Update API docs for ray.init (#16533)
The incorrect indentation caused the docs render weirdly: 

https://docs.ray.io/en/master/package-ref.html
2021-06-18 18:02:44 -07:00
Alex Wu
319d4fb164
Job timestamp should always be in milliseconds (fixed) (#16548)
* .

* Revert "Revert "Job timestamp should always be in milliseconds (#16455)" (#16545)"

This reverts commit 5030ed8588.

* .

* .

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-18 17:07:21 -07:00
Amog Kamsetty
416cf3a2e7
Revert "Revert "Enable TryCreateImmediately to use the fallback allocation" (#16542)" (#16544)
This reverts commit 36fd741e6f.
2021-06-18 15:39:37 -07:00
Jiao
39cc81c633
[serve] Fix ray serve shutdown to properly go through controller (#16524) 2021-06-18 17:18:04 -05:00
architkulkarni
3ba1cb851e
[Core] [runtime env] Print message on driver when installing conda or pip (#16516) 2021-06-18 16:02:46 -05:00
Amog Kamsetty
e6fa8c0015
[Hotfix] [Dask] Fix Dask Pin (#16552)
* dask-pin-36

* fix
2021-06-18 13:31:50 -07:00
Amog Kamsetty
904232b4f8
[Dask] Pin dask version to 2021.06.0 (#16546) 2021-06-18 12:40:14 -07:00
Alex Wu
5030ed8588
Revert "Job timestamp should always be in milliseconds (#16455)" (#16545)
This reverts commit 1df19a04fe.
2021-06-18 12:37:05 -07:00
Amog Kamsetty
36fd741e6f
Revert "Enable TryCreateImmediately to use the fallback allocation" (#16542)
This reverts commit 41cf2e3d50.
2021-06-18 12:22:18 -07:00
Frank Luan
7588938e3c
Sorting benchmark (#16327)
* [WIP] Sorting benchmark

* Separate num_mappers and num_reducers

* Add tests

* Fix tests

* flake8

* flake8

* yapf

* Skip test on Windows

* Fix OS X test failure; test Windows again

* oops
2021-06-18 10:54:18 -07:00
Eric Liang
41cf2e3d50
Enable TryCreateImmediately to use the fallback allocation 2021-06-18 10:49:34 -07:00
architkulkarni
6498ca3995
[Core] [runtime env] Don't delete working_dir from runtime env (#16475) 2021-06-18 10:15:20 -05:00
Chris K. W
a2c842ee3c
[Client] Add separate error message if dataclient has disconnected before a request is sent (#16508)
* Add earlier error message

* Adjust error message
2021-06-18 08:06:25 -07:00
Kai Fricke
172d33be02
[tune] Use unbuffered training when checkpoint_at_end is used. (#16504) 2021-06-18 14:19:14 +01:00
Kai Fricke
e13f0a4d91
[tune] Add option to keep random values constant over grid search (#16501) 2021-06-18 11:30:27 +01:00
Chris K. W
c91a1b1f92
[Client] Add warnings when user schedules many tasks with ray client (#16454)
* Add warnings when user schedules many tasks with ray client

* add test_client_warnings to BUILD

* better variable names

* use util.debug.log_once()

* batching -> explanation of batching

* Switch to warnings.warn

* Add links to Ray Design Pattern doc with code snippets

* Cleaner linking and refer to sections directly

* Better testNoWarning

* add sys.exit(pytest.main(...))

* Update python/ray/util/client/worker.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* Update python/ray/util/client/worker.py

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>

* better error messages

* Switch links to new readthedocs sections

* Revert "Switch links to new readthedocs sections"

This reverts commit d3785bf50459d89fb3f13966a030e954799309a8.

Co-authored-by: Ameer Haj Ali <ameerh@berkeley.edu>
2021-06-18 13:17:37 +03:00
Alex Wu
6696c0c165
Revert "[Placement Group] Support infeasible placement groups for Placement Group. (#16188)" (#16509)
This reverts commit 7f91cfedd5.
2021-06-17 11:04:01 -07:00
architkulkarni
8d9a41af55
[Core] [runtime env] Merge actor/task's runtime env with JobConfig's runtime env (#16378) 2021-06-17 11:20:32 -05:00
Antoni Baum
f8e9f171df
[tune] Add add_evaluated_point method (#16485) 2021-06-17 11:30:48 +01:00
Kai Fricke
e547a27944
[tune] Track live trials in a set in the TrialRunner to reduce linear scans (#15811) 2021-06-17 01:36:07 -07:00
Alex Wu
1df19a04fe
Job timestamp should always be in milliseconds (#16455)
* .

* .

* .

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-17 00:05:55 -07:00
DK.Pino
7f91cfedd5
[Placement Group] Support infeasible placement groups for Placement Group. (#16188)
* init

* update comment

* update logical

* ut failing

* compile passing

* add ut

* lint

* fix comment

* lint

* fix ut and typo

* fix ut and typo

* lint

* typo
2021-06-16 21:48:39 -07:00
Alex Wu
45357ff590
[core] Fix multi-node placement group/job config bugs (#16345)
* .

* .

* seems to work?

* seems to work?

* .

* implement delete

* implement delete

* .

* tests

* .

* .

* .

* fix

* .

* .

* .

* .

* fix

* fix

* bump timeout

* bump timeout

* .

Co-authored-by: Alex Wu <alex@anyscale.com>
2021-06-16 21:12:20 -07:00