Commit graph

9876 commits

Author SHA1 Message Date
Qing Wang
1047914ee0
[Java] Skip javadoc when deploying. (#19428) 2021-10-17 15:21:13 +08:00
Hao Zhang
c96c2e9b5f
[Collective] Enhance the collective group GC a bit (#19402) 2021-10-15 18:47:54 -07:00
Yi Cheng
a3dc07b1ee
[core] Fix some legacy issues (#19392)
## Why are these changes needed?
There are some issues left from previous PRs.

- Put the gcs_actor_scheduler_mock_test back
- Add comment for named actor creation behavior
- Fix the comment for some flags. 

## Related issue number
2021-10-15 18:06:01 -07:00
Chen Shen
a9c34d55e3
Throw if infinite (#19418) 2021-10-15 18:01:53 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup (#19014)
* Added StartupToken to idenitfy a process at startup

* Applied linting formats

* Addressed reviews

* Fixing worker_pool_test

* Fixed worker_pool_test

* Applied linting formatting

* Added documentation for StartupToken

* Fixed linting

* Reordered initialisation of WorkerPool members

* Fixed Python docs

* Fixing bugs in cluster_mode_test

* Fixing Java tests

* Create and set shim process after verifying startup_token

* shim_process.GetId() -> worker_shim_pid

* Improvements in startup token and modifying java files

* update io_ray_runtime_RayNativeRuntime.h

* Fixed java tests by adding startup-token to conf

* Applied linting

* Increased arg count for startup_token

* Attempt to fix streaming tests

* Type correction

* applied linting

* Corrected index of startup token arg

* Modified, mock_worker.cc to accept startup tokens

* Applied linting

* Applied linting changes from CI

* Removed override from worker.h

* Applied linting from scripts/format.sh

* Addressed reviews and applied scripts/format.sh

* Applied linting script from ci/travis

* Removed unrequired methods from public scope

* Applied linting
2021-10-15 15:13:13 -07:00
Chen Shen
acfbf4c170
Fix from Dask bug in Datasets (#19409) 2021-10-15 15:04:52 -07:00
Gagandeep Singh
07064cddf9
Re-enabling tests from test_basic (#19384)
Why are these changes needed?
Related issue number
##19177

Quoting #19177 (comment) here,

The following tests fail when not skipped,

=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...

Results (395.22s):
      36 passed
       3 failed
         - ray\tests/test_basic.py:197 test_user_setup_function
         - ray\tests/test_basic.py:220 test_disable_cuda_devices
         - ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23

Results (198.33s):
       1 failed
         - ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.

def test_oversized_function(ray_start_shared_local_modes)
2021-10-15 14:02:57 -07:00
Kai Fricke
bb38c5cb1f
[tune] Fix result buffering case check (fixes bug introduced in #19140) (#19399) 2021-10-15 10:43:34 +01:00
Siyuan (Ryans) Zhuang
0d4b0ded27
[Serialization] Update cloudpickle to v2.0.0 (#19383)
* update cloudpickle to v2.0.0
2021-10-15 02:37:29 -07:00
Hao Zhang
4b92f34ada
[Collective] Remove an unnecessary cuda.stream.synchornize (#19400) 2021-10-14 21:33:59 -07:00
SangBin Cho
9bfe43198f
Use cleaner code for the map (#19386) 2021-10-14 21:18:42 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows (#19319) 2021-10-14 14:42:22 -07:00
Kai Fricke
e17b23fa5b
[ci/release] Add support for RAY_WHEELS url (#19364) 2021-10-14 21:40:01 +01:00
architkulkarni
b3ccec5d76
[runtime_env] Fix bug when all working_dir contents are excluded with Ray Client (#19377) 2021-10-14 11:20:45 -07:00
Carlo Grisetti
30fe93d285
[Windows] Use correct interpreter and fix prometheus atomic file rename (#19171) 2021-10-14 10:29:21 -07:00
Kai Fricke
e07d0953ea
[ci/release] Undo faulty change to many_ppo num_samples (#19388) 2021-10-14 10:27:31 -07:00
Eric Liang
13d4ad6100
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00
SangBin Cho
4edb3c4746
[Test] Add complicated threaded actor tests (#19374)
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.

The third tests print a lot of

(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.

Related issue number
Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :
2021-10-14 09:06:11 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray (#19186)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
9cee83c919
[tune] PBT: Add burn-in period (#19321) 2021-10-14 16:28:29 +01:00
Edward Oakes
888fb24c25
Remove deprecated ray.services package (#18475)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:28:16 +01:00
Kai Fricke
312dc369a7
Revert "[Hotfix] Revert "[tune/wip] Exclude trial checkpoints in experiment sync"" (#19285)
This reverts commit a92f1fedf4.
and fixes the failing test
2021-10-14 11:18:48 +01:00
Qing Wang
2cc164e616
[Java] Fix incompleted core worker dynamic library. (#19342)
* Fix incompleted core worker dynamic library.

* Fix lint.
2021-10-14 14:42:05 +08:00
mwtian
12100015d9
[Lint] Disable modernize-use-override (#19368)
This lint rule cannot apply only to changed lines because currently Ray has `-Winconsistent-missing-override` as a build flag. Either all or none of member functions from a derived class can have the `override` / `final` annocation.
2021-10-13 20:20:08 -07:00
Carlo Grisetti
5cee8a1985
[release tests] Switch from yaml.load to yaml.safe_load (#19365) 2021-10-13 17:27:25 -07:00
Edward Oakes
2ac81f336a
[serve] Remove BackendConfig broadcasting (#19154) 2021-10-13 16:25:34 -07:00
Chen Shen
b8c201b7cb
[Core][CoreWorker] Make WorkerContext thread safe, fix race condition. #19343
Why are these changes needed?
The theory around #19270 is there are two create actor requests sent to the same threaded actor due to retry logic. Specifically:

the first request comes and calls CoreWorkerDirectTaskReceiver::HandleTask, it's queued to be executed by thread pool;
then the second request comes and calls CoreWorkerDirectTaskReceiver::HandleTask again, before first request being executed and calls worker_context_.SetCurrentTask;
this fails the current dedupe logic and leads to SetMaxActorConcurrency be called twice, which fails the RAY_CHECK.
In this PR, we fix the dedupe logic by adding SetCurrentActorId and calling it in the task execution thread. this ensures the dedupe logic works for threaded actor.

we also noticed that the WorkerContext is actually not thread safe in threaded actors, thus make it thread safe in this PR as well.

Related issue number
Closes #19270

Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :(
2021-10-13 16:12:36 -07:00
Linsong Chu
b86a5fcb96
[workflow] fix workflow user metadata return when None is given (#19356)
## Why are these changes needed?

Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`.
## Related issue number

original issue: https://github.com/ray-project/ray/issues/17090
original PR: https://github.com/ray-project/ray/pull/19195
2021-10-13 15:52:12 -07:00
Yi Cheng
1dc03cd49d
[nightly] Put many nodes actor test back (#19313)
## Why are these changes needed?
There are two issues fixed in this PR:
- make sure wait for session count alive node
- upgrade the machine to match what's tested in oss ray.

## Related issue number
https://github.com/ray-project/ray/issues/19084
2021-10-13 15:51:12 -07:00
matthewdeng
d998373968
[release] fix test by pinning filelock (#19334)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-13 22:27:04 +01:00
architkulkarni
b0716f66ae
[runtime env] Fix handling of runtime env with None fields (#19300) 2021-10-13 13:57:55 -07:00
Jiao
893f76daf9
[serve] Add serve FT nightly test to buildkite (#19361) 2021-10-13 13:56:55 -07:00
Antoni Baum
3cb0862152
Fix double gym in requirements (#19357) 2021-10-13 21:43:41 +01:00
Omkar Pangarkar
f1b9b16ae9
[tune] Fix DistributedTrainable restore (#19349) 2021-10-13 21:29:05 +01:00
Carlo Grisetti
da7a485786
[Windows] use dynamic temp path (#19096) 2021-10-13 13:02:45 -04:00
hazeone
c2f0035fd2
[Java]Support getGpuIds API (#19031)
Add java getGpuIds() API which is the same as get_gpu_ids in python. We can get deviceId if we've allocated a GPU to a worker.
2021-10-13 23:40:26 +08:00
Kai Fricke
bde9e058da
Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)" (#19351)
This reverts commit 74ee99ff99.
2021-10-13 13:06:36 +01:00
Linsong Chu
ce64e6dc45
[workflow] add metadata put in workflow (#19195)
## Why are these changes needed?

Add metadata to workflow.  Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.

We are adding various of metadata including:

1. step-level user metadata.  can be set with `step.options(metadata={})`
2. step-level pre-run metadata.  this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata.  this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with  `workflow.run(metadata={})`
5. workflow-level pre-run metadata.  this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata.  this captures post-run metadata such as workflow_end_time, more metrics can be added later.

## Related issue number

https://github.com/ray-project/ray/issues/17090

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-10-12 21:01:24 -07:00
Clark Zinzow
1b179adfa1
[Core] [Hotfix] Handle logging redirected to stdout when configuring log file (#19301) 2021-10-12 19:03:21 -07:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Eric Liang
430a5f4a21
[doc] Bump dataset to beta for 1.8 and add backlink to SGD (#19332) 2021-10-12 18:32:29 -07:00
Clark Zinzow
df6d06bd41
Fix for LazyBlockList refactor. (#19333) 2021-10-12 18:18:45 -07:00
Jasha10
53e791d136
[Docs] Fix Typo in walkthrough (#19335)
There is one backtick too many in walkthrough.rst, it's causing a formatting issue.
2021-10-12 17:47:28 -07:00
Amog Kamsetty
09d8049584
[SGD] Make actor creation async (#19325)
* fix

* fix

* fix
2021-10-12 16:15:59 -07:00
Jiajun Yao
d99b095eac
Set default max_pending_lease_requests_per_scheduling_category to 1 (#19328) 2021-10-12 15:59:32 -07:00
Eric Liang
9f1cd9e867
[docs] Document fake multi-node autoscaler (#19329) 2021-10-12 15:59:07 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Carlo Grisetti
7651cc782a
Change prometheus warning filename source (#19275)
* Change prometheus warning filename source

* Fix linting
2021-10-12 14:02:51 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Akash Patel
b897b7b3be
add missing <memory> include (#19083) 2021-10-12 12:03:07 -07:00