Commit graph

5294 commits

Author SHA1 Message Date
Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search (#19376) 2021-10-18 16:03:49 +08:00
Hao Zhang
c96c2e9b5f
[Collective] Enhance the collective group GC a bit (#19402) 2021-10-15 18:47:54 -07:00
Chen Shen
a9c34d55e3
Throw if infinite (#19418) 2021-10-15 18:01:53 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup (#19014)
* Added StartupToken to idenitfy a process at startup

* Applied linting formats

* Addressed reviews

* Fixing worker_pool_test

* Fixed worker_pool_test

* Applied linting formatting

* Added documentation for StartupToken

* Fixed linting

* Reordered initialisation of WorkerPool members

* Fixed Python docs

* Fixing bugs in cluster_mode_test

* Fixing Java tests

* Create and set shim process after verifying startup_token

* shim_process.GetId() -> worker_shim_pid

* Improvements in startup token and modifying java files

* update io_ray_runtime_RayNativeRuntime.h

* Fixed java tests by adding startup-token to conf

* Applied linting

* Increased arg count for startup_token

* Attempt to fix streaming tests

* Type correction

* applied linting

* Corrected index of startup token arg

* Modified, mock_worker.cc to accept startup tokens

* Applied linting

* Applied linting changes from CI

* Removed override from worker.h

* Applied linting from scripts/format.sh

* Addressed reviews and applied scripts/format.sh

* Applied linting script from ci/travis

* Removed unrequired methods from public scope

* Applied linting
2021-10-15 15:13:13 -07:00
Chen Shen
acfbf4c170
Fix from Dask bug in Datasets (#19409) 2021-10-15 15:04:52 -07:00
Gagandeep Singh
07064cddf9
Re-enabling tests from test_basic (#19384)
Why are these changes needed?
Related issue number
##19177

Quoting #19177 (comment) here,

The following tests fail when not skipped,

=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...

Results (395.22s):
      36 passed
       3 failed
         - ray\tests/test_basic.py:197 test_user_setup_function
         - ray\tests/test_basic.py:220 test_disable_cuda_devices
         - ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23

Results (198.33s):
       1 failed
         - ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.

def test_oversized_function(ray_start_shared_local_modes)
2021-10-15 14:02:57 -07:00
Kai Fricke
bb38c5cb1f
[tune] Fix result buffering case check (fixes bug introduced in #19140) (#19399) 2021-10-15 10:43:34 +01:00
Siyuan (Ryans) Zhuang
0d4b0ded27
[Serialization] Update cloudpickle to v2.0.0 (#19383)
* update cloudpickle to v2.0.0
2021-10-15 02:37:29 -07:00
Hao Zhang
4b92f34ada
[Collective] Remove an unnecessary cuda.stream.synchornize (#19400) 2021-10-14 21:33:59 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows (#19319) 2021-10-14 14:42:22 -07:00
architkulkarni
b3ccec5d76
[runtime_env] Fix bug when all working_dir contents are excluded with Ray Client (#19377) 2021-10-14 11:20:45 -07:00
Carlo Grisetti
30fe93d285
[Windows] Use correct interpreter and fix prometheus atomic file rename (#19171) 2021-10-14 10:29:21 -07:00
Eric Liang
13d4ad6100
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00
SangBin Cho
4edb3c4746
[Test] Add complicated threaded actor tests (#19374)
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.

The third tests print a lot of

(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.

Related issue number
Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :
2021-10-14 09:06:11 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray (#19186)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
9cee83c919
[tune] PBT: Add burn-in period (#19321) 2021-10-14 16:28:29 +01:00
Edward Oakes
888fb24c25
Remove deprecated ray.services package (#18475)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:28:16 +01:00
Kai Fricke
312dc369a7
Revert "[Hotfix] Revert "[tune/wip] Exclude trial checkpoints in experiment sync"" (#19285)
This reverts commit a92f1fedf4.
and fixes the failing test
2021-10-14 11:18:48 +01:00
Edward Oakes
2ac81f336a
[serve] Remove BackendConfig broadcasting (#19154) 2021-10-13 16:25:34 -07:00
Linsong Chu
b86a5fcb96
[workflow] fix workflow user metadata return when None is given (#19356)
## Why are these changes needed?

Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`.
## Related issue number

original issue: https://github.com/ray-project/ray/issues/17090
original PR: https://github.com/ray-project/ray/pull/19195
2021-10-13 15:52:12 -07:00
architkulkarni
b0716f66ae
[runtime env] Fix handling of runtime env with None fields (#19300) 2021-10-13 13:57:55 -07:00
Antoni Baum
3cb0862152
Fix double gym in requirements (#19357) 2021-10-13 21:43:41 +01:00
Omkar Pangarkar
f1b9b16ae9
[tune] Fix DistributedTrainable restore (#19349) 2021-10-13 21:29:05 +01:00
Carlo Grisetti
da7a485786
[Windows] use dynamic temp path (#19096) 2021-10-13 13:02:45 -04:00
Kai Fricke
bde9e058da
Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)" (#19351)
This reverts commit 74ee99ff99.
2021-10-13 13:06:36 +01:00
Linsong Chu
ce64e6dc45
[workflow] add metadata put in workflow (#19195)
## Why are these changes needed?

Add metadata to workflow.  Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.

We are adding various of metadata including:

1. step-level user metadata.  can be set with `step.options(metadata={})`
2. step-level pre-run metadata.  this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata.  this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with  `workflow.run(metadata={})`
5. workflow-level pre-run metadata.  this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata.  this captures post-run metadata such as workflow_end_time, more metrics can be added later.

## Related issue number

https://github.com/ray-project/ray/issues/17090

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-10-12 21:01:24 -07:00
Clark Zinzow
1b179adfa1
[Core] [Hotfix] Handle logging redirected to stdout when configuring log file (#19301) 2021-10-12 19:03:21 -07:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Clark Zinzow
df6d06bd41
Fix for LazyBlockList refactor. (#19333) 2021-10-12 18:18:45 -07:00
Amog Kamsetty
09d8049584
[SGD] Make actor creation async (#19325)
* fix

* fix

* fix
2021-10-12 16:15:59 -07:00
Eric Liang
9f1cd9e867
[docs] Document fake multi-node autoscaler (#19329) 2021-10-12 15:59:07 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Carlo Grisetti
7651cc782a
Change prometheus warning filename source (#19275)
* Change prometheus warning filename source

* Fix linting
2021-10-12 14:02:51 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Lixin Wei
f2f9c749cb
[Build] Add an Option to Skip Bazel Build (#19265) 2021-10-12 12:01:58 -07:00
Eric Liang
0ab6749602
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group (#19277) 2021-10-12 11:04:14 -07:00
Clark Zinzow
6ca3c02041
[Datasets] Parallelize Parquet metadata fetches. (#19211) 2021-10-12 11:02:30 -07:00
dependabot[bot]
74ee99ff99
[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)
* [RLlib](deps): Bump tensorflow in /python/requirements/rllib

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.6.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.6.0)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* wip.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-12 17:56:36 +02:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] (#19220)
* done

* Fix all unit tests

* done

* .

* Fix the build issue

* fix the compilation bug
2021-10-12 07:41:51 -07:00
Wansoo Kim
0f6d4661d7
[tune] Port all MNIST examples to specify data_dir (#19033)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-12 15:36:06 +01:00
gjoliver
5d14904b9b
[Tune] catch HTTPError when logging to wandb. (#19314) 2021-10-12 14:38:17 +01:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only tag from test tags (#19287) 2021-10-12 10:05:46 +01:00
Chris K. W
35230ea9fa
[client] deflake test_stdout_log_stream (#19232)
* deflake test_stdout_log_stream

* add assert message
2021-10-11 22:22:39 -07:00
architkulkarni
cc16e8f8c5
[runtime env] Validate "excludes" field (#19302) 2021-10-11 20:05:22 -07:00
Jiao
85b8a6de5f
[Serve] Add nightly test for Serve failure recovery (#19125) 2021-10-11 18:33:20 -07:00
Carlo Grisetti
c2377fb725
[Serve] Call without loop parameter if python 3.10+ (#19298) 2021-10-11 18:31:13 -07:00
Eric Liang
6cacc54774
[RFC] Fake multi-node mode for autoscaler (#18987) 2021-10-11 18:27:29 -07:00
SangBin Cho
0d7a7a06c0
[Placement group] Warm up the cluster before running the unit test #19286 (#19286) 2021-10-11 16:26:52 -07:00
Carlo Grisetti
2d0355548e
[Dashboard] Try to work around aiohttp 4.0.0 breaking changes (#19120) 2021-10-11 16:25:52 -07:00