Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search ( #19376 )
2021-10-18 16:03:49 +08:00
Hao Zhang
c96c2e9b5f
[Collective] Enhance the collective group GC a bit ( #19402 )
2021-10-15 18:47:54 -07:00
Chen Shen
a9c34d55e3
Throw if infinite ( #19418 )
2021-10-15 18:01:53 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup ( #19014 )
...
* Added StartupToken to idenitfy a process at startup
* Applied linting formats
* Addressed reviews
* Fixing worker_pool_test
* Fixed worker_pool_test
* Applied linting formatting
* Added documentation for StartupToken
* Fixed linting
* Reordered initialisation of WorkerPool members
* Fixed Python docs
* Fixing bugs in cluster_mode_test
* Fixing Java tests
* Create and set shim process after verifying startup_token
* shim_process.GetId() -> worker_shim_pid
* Improvements in startup token and modifying java files
* update io_ray_runtime_RayNativeRuntime.h
* Fixed java tests by adding startup-token to conf
* Applied linting
* Increased arg count for startup_token
* Attempt to fix streaming tests
* Type correction
* applied linting
* Corrected index of startup token arg
* Modified, mock_worker.cc to accept startup tokens
* Applied linting
* Applied linting changes from CI
* Removed override from worker.h
* Applied linting from scripts/format.sh
* Addressed reviews and applied scripts/format.sh
* Applied linting script from ci/travis
* Removed unrequired methods from public scope
* Applied linting
2021-10-15 15:13:13 -07:00
Chen Shen
acfbf4c170
Fix from Dask bug in Datasets ( #19409 )
2021-10-15 15:04:52 -07:00
Gagandeep Singh
07064cddf9
Re-enabling tests from test_basic ( #19384 )
...
Why are these changes needed?
Related issue number
##19177
Quoting #19177 (comment) here,
The following tests fail when not skipped,
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...
Results (395.22s):
36 passed
3 failed
- ray\tests/test_basic.py:197 test_user_setup_function
- ray\tests/test_basic.py:220 test_disable_cuda_devices
- ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23
Results (198.33s):
1 failed
- ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.
def test_oversized_function(ray_start_shared_local_modes)
2021-10-15 14:02:57 -07:00
Kai Fricke
bb38c5cb1f
[tune] Fix result buffering case check (fixes bug introduced in #19140 ) ( #19399 )
2021-10-15 10:43:34 +01:00
Siyuan (Ryans) Zhuang
0d4b0ded27
[Serialization] Update cloudpickle to v2.0.0 ( #19383 )
...
* update cloudpickle to v2.0.0
2021-10-15 02:37:29 -07:00
Hao Zhang
4b92f34ada
[Collective] Remove an unnecessary cuda.stream.synchornize ( #19400 )
2021-10-14 21:33:59 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows ( #19319 )
2021-10-14 14:42:22 -07:00
architkulkarni
b3ccec5d76
[runtime_env] Fix bug when all working_dir contents are excluded with Ray Client ( #19377 )
2021-10-14 11:20:45 -07:00
Carlo Grisetti
30fe93d285
[Windows] Use correct interpreter and fix prometheus atomic file rename ( #19171 )
2021-10-14 10:29:21 -07:00
Eric Liang
13d4ad6100
[data] Preserve epoch by default when using rewindow() ( #19359 )
2021-10-14 09:17:36 -07:00
SangBin Cho
4edb3c4746
[Test] Add complicated threaded actor tests ( #19374 )
...
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.
The third tests print a lot of
(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.
Related issue number
Checks
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/ .
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
Unit tests
Release tests
This PR is not tested :
2021-10-14 09:06:11 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray ( #19186 )
...
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
9cee83c919
[tune] PBT: Add burn-in period ( #19321 )
2021-10-14 16:28:29 +01:00
Edward Oakes
888fb24c25
Remove deprecated ray.services package ( #18475 )
...
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:28:16 +01:00
Kai Fricke
312dc369a7
Revert "[Hotfix] Revert "[tune/wip] Exclude trial checkpoints in experiment sync"" ( #19285 )
...
This reverts commit a92f1fedf4
.
and fixes the failing test
2021-10-14 11:18:48 +01:00
Edward Oakes
2ac81f336a
[serve] Remove BackendConfig broadcasting ( #19154 )
2021-10-13 16:25:34 -07:00
Linsong Chu
b86a5fcb96
[workflow] fix workflow user metadata return when None is given ( #19356 )
...
## Why are these changes needed?
Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`.
## Related issue number
original issue: https://github.com/ray-project/ray/issues/17090
original PR: https://github.com/ray-project/ray/pull/19195
2021-10-13 15:52:12 -07:00
architkulkarni
b0716f66ae
[runtime env] Fix handling of runtime env with None fields ( #19300 )
2021-10-13 13:57:55 -07:00
Antoni Baum
3cb0862152
Fix double gym in requirements ( #19357 )
2021-10-13 21:43:41 +01:00
Omkar Pangarkar
f1b9b16ae9
[tune] Fix DistributedTrainable
restore ( #19349 )
2021-10-13 21:29:05 +01:00
Carlo Grisetti
da7a485786
[Windows] use dynamic temp path ( #19096 )
2021-10-13 13:02:45 -04:00
Kai Fricke
bde9e058da
Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib ( #18183 )" ( #19351 )
...
This reverts commit 74ee99ff99
.
2021-10-13 13:06:36 +01:00
Linsong Chu
ce64e6dc45
[workflow] add metadata put in workflow ( #19195 )
...
## Why are these changes needed?
Add metadata to workflow. Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.
We are adding various of metadata including:
1. step-level user metadata. can be set with `step.options(metadata={})`
2. step-level pre-run metadata. this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata. this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with `workflow.run(metadata={})`
5. workflow-level pre-run metadata. this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata. this captures post-run metadata such as workflow_end_time, more metrics can be added later.
## Related issue number
https://github.com/ray-project/ray/issues/17090
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-10-12 21:01:24 -07:00
Clark Zinzow
1b179adfa1
[Core] [Hotfix] Handle logging redirected to stdout when configuring log file ( #19301 )
2021-10-12 19:03:21 -07:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… ( #19330 )
2021-10-12 19:02:30 -07:00
Clark Zinzow
df6d06bd41
Fix for LazyBlockList refactor. ( #19333 )
2021-10-12 18:18:45 -07:00
Amog Kamsetty
09d8049584
[SGD] Make actor creation async ( #19325 )
...
* fix
* fix
* fix
2021-10-12 16:15:59 -07:00
Eric Liang
9f1cd9e867
[docs] Document fake multi-node autoscaler ( #19329 )
2021-10-12 15:59:07 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration ( #17626 )
...
* wip
* wip
* wip
* draft
* disable tf autosharding
* wip
* wip
* wip
* wip
* add example
* wip
* wip
* wip
* use dataset.split
* add unit tests
* add linear example
* concatenate tensors and fix example
* WIP tune example
* add tensorflow example
* wip
* random_shuffle_each_window
* fault tolerance test
* GPU, examples, CI
* formatting
* fix
* Update python/ray/util/sgd/v2/tests/test_trainer.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* wip
* type hints
* wip
* update user guide
* fix
* fix immediate issues
* update example
* update
* fix tune gpu test
* fix resources for smoke test - 1 CPU for dataset tasks
* update tests, docs, examples
* Apply suggestions from code review
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
* address comments
* add warning
* fix tests
* minor doc updates
* update example in doc
* configure tests
* Update doc/source/raysgd/v2/user_guide.rst
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
* Update python/ray/data/dataset.py
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
* fix docstring
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Carlo Grisetti
7651cc782a
Change prometheus warning filename source ( #19275 )
...
* Change prometheus warning filename source
* Fix linting
2021-10-12 14:02:51 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group ( #19277 )" ( #19327 )
...
This reverts commit 4360b99803
.
2021-10-12 12:41:51 -07:00
Lixin Wei
f2f9c749cb
[Build] Add an Option to Skip Bazel Build ( #19265 )
2021-10-12 12:01:58 -07:00
Eric Liang
0ab6749602
Support iter_epochs for Datasets ( #19217 )
2021-10-12 11:05:00 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group ( #19277 )
2021-10-12 11:04:14 -07:00
Clark Zinzow
6ca3c02041
[Datasets] Parallelize Parquet metadata fetches. ( #19211 )
2021-10-12 11:02:30 -07:00
dependabot[bot]
74ee99ff99
[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib ( #18183 )
...
* [RLlib](deps): Bump tensorflow in /python/requirements/rllib
Bumps [tensorflow](https://github.com/tensorflow/tensorflow ) from 2.5.0 to 2.6.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases )
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md )
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.6.0 )
---
updated-dependencies:
- dependency-name: tensorflow
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* wip.
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-12 17:56:36 +02:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] ( #19220 )
...
* done
* Fix all unit tests
* done
* .
* Fix the build issue
* fix the compilation bug
2021-10-12 07:41:51 -07:00
Wansoo Kim
0f6d4661d7
[tune] Port all MNIST examples to specify data_dir ( #19033 )
...
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-12 15:36:06 +01:00
gjoliver
5d14904b9b
[Tune] catch HTTPError when logging to wandb. ( #19314 )
2021-10-12 14:38:17 +01:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only
tag from test tags ( #19287 )
2021-10-12 10:05:46 +01:00
Chris K. W
35230ea9fa
[client] deflake test_stdout_log_stream ( #19232 )
...
* deflake test_stdout_log_stream
* add assert message
2021-10-11 22:22:39 -07:00
architkulkarni
cc16e8f8c5
[runtime env] Validate "excludes" field ( #19302 )
2021-10-11 20:05:22 -07:00
Jiao
85b8a6de5f
[Serve] Add nightly test for Serve failure recovery ( #19125 )
2021-10-11 18:33:20 -07:00
Carlo Grisetti
c2377fb725
[Serve] Call without loop parameter if python 3.10+ ( #19298 )
2021-10-11 18:31:13 -07:00
Eric Liang
6cacc54774
[RFC] Fake multi-node mode for autoscaler ( #18987 )
2021-10-11 18:27:29 -07:00
SangBin Cho
0d7a7a06c0
[Placement group] Warm up the cluster before running the unit test #19286 ( #19286 )
2021-10-11 16:26:52 -07:00
Carlo Grisetti
2d0355548e
[Dashboard] Try to work around aiohttp 4.0.0 breaking changes ( #19120 )
2021-10-11 16:25:52 -07:00