Commit graph

5599 commits

Author SHA1 Message Date
Gagandeep Singh
0b82135d2d
Use 127.0.0.1 in win32 as node ip addr (#19362) 2021-10-18 15:51:15 -07:00
Ian Rodney
74db390d15
[Docker] Fix Rsync (#19020)
* rsync down

* Rsync up, but not delete

* test fixes

* Explicit rsync -e

* Better copy check

* quick comment

* Additional fix to rsync_up
2021-10-18 14:35:22 -07:00
Kai Fricke
6798bdbb5d
Revert "Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib"" (#19352)
This reverts commit bde9e058da.
2021-10-18 22:29:16 +01:00
Eric Liang
1bb2b1fc49
[hotfix] Pin pyspark dep to 3.1.2 2021-10-18 13:10:06 -07:00
mwtian
9742abb749
[Debugging] Print Python stack trace in addition to C++ stack trace, when Python worker crashes (#19423)
Why are these changes needed?
Right now the failure signal handler registered in Python worker is skipped on crashes like segfault, because C++ core worker overrides the failure signal handler here and does not call the previously registered handler. This prevents Python stack trace from being printed on crashes. The fix is to make the C++ fault signal handler to call the previous signal handler registered in Python. For example with the script below which segfaults,

import ray
ray.init()

@ray.remote
def f():
    import ctypes;
    ctypes.string_at(0)

ray.get(f.remote())
Ray currently only prints the following stack trace:

(pid=26693) *** SIGSEGV received at time=1634418743 ***
(pid=26693) PC: @     0x7fff203d9552  (unknown)  _platform_strlen
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: *** SIGSEGV received at time=1634418743 ***
(pid=26693) [2021-10-16 14:12:23,331 E 26693 12194577] logging.cc:313: PC: @     0x7fff203d9552  (unknown)  _platform_strlen
With this change, Python stack trace will be printed in addition to the stack trace above:

(pid=26693) Fatal Python error: Segmentation fault
(pid=26693)
(pid=26693) Stack (most recent call first):
(pid=26693)   File "/Users/mwtian/opt/anaconda3/envs/ray/lib/python3.7/ctypes/__init__.py", line 505 in string_at
(pid=26693)   File "stack.py", line 7 in f
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/worker.py", line 425 in main_loop
(pid=26693)   File "/Users/mwtian/work/ray-project/ray/python/ray/workers/default_worker.py", line 212 in <module>
This should make debugging crashes in Python worker easier, for users and Ray devs.

Also, try to initialize symbolizer in GCS, Raylet and core worker. This is a no-op on MacOS and some Linux environments (e.g. Ray on Ubuntu 20.04 already produces symbolized stack traces), but should make Ray more likely to have symbolized stack traces on other platforms.
2021-10-18 09:05:08 -07:00
Guyang Song
c04fb62f1d
[C++ worker] set native library path for shared library search (#19376) 2021-10-18 16:03:49 +08:00
Hao Zhang
c96c2e9b5f
[Collective] Enhance the collective group GC a bit (#19402) 2021-10-15 18:47:54 -07:00
Chen Shen
a9c34d55e3
Throw if infinite (#19418) 2021-10-15 18:01:53 -07:00
Gagandeep Singh
d226cbf21a
Added StartupToken to idenitfy a process at startup (#19014)
* Added StartupToken to idenitfy a process at startup

* Applied linting formats

* Addressed reviews

* Fixing worker_pool_test

* Fixed worker_pool_test

* Applied linting formatting

* Added documentation for StartupToken

* Fixed linting

* Reordered initialisation of WorkerPool members

* Fixed Python docs

* Fixing bugs in cluster_mode_test

* Fixing Java tests

* Create and set shim process after verifying startup_token

* shim_process.GetId() -> worker_shim_pid

* Improvements in startup token and modifying java files

* update io_ray_runtime_RayNativeRuntime.h

* Fixed java tests by adding startup-token to conf

* Applied linting

* Increased arg count for startup_token

* Attempt to fix streaming tests

* Type correction

* applied linting

* Corrected index of startup token arg

* Modified, mock_worker.cc to accept startup tokens

* Applied linting

* Applied linting changes from CI

* Removed override from worker.h

* Applied linting from scripts/format.sh

* Addressed reviews and applied scripts/format.sh

* Applied linting script from ci/travis

* Removed unrequired methods from public scope

* Applied linting
2021-10-15 15:13:13 -07:00
Chen Shen
acfbf4c170
Fix from Dask bug in Datasets (#19409) 2021-10-15 15:04:52 -07:00
Gagandeep Singh
07064cddf9
Re-enabling tests from test_basic (#19384)
Why are these changes needed?
Related issue number
##19177

Quoting #19177 (comment) here,

The following tests fail when not skipped,

=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic.py::test_user_setup_function - subprocess.CalledProcessErro...
FAILED python\ray\tests\test_basic.py::test_disable_cuda_devices - subprocess.CalledProcessErr...
FAILED python\ray\tests\test_basic.py::test_wait_timing - assert (1634209333.6099107 - 1634209...

Results (395.22s):
      36 passed
       3 failed
         - ray\tests/test_basic.py:197 test_user_setup_function
         - ray\tests/test_basic.py:220 test_disable_cuda_devices
         - ray\tests/test_basic.py:265 test_wait_timing
=================================== short test summary info ====================================
FAILED python\ray\tests\test_basic_3.py::test_fair_queueing - AssertionError: 23

Results (198.33s):
       1 failed
         - ray\tests/test_basic_3.py:169 test_fair_queueing
The following test passed when not skipped. Opening a PR to verify that.

def test_oversized_function(ray_start_shared_local_modes)
2021-10-15 14:02:57 -07:00
Kai Fricke
bb38c5cb1f
[tune] Fix result buffering case check (fixes bug introduced in #19140) (#19399) 2021-10-15 10:43:34 +01:00
Siyuan (Ryans) Zhuang
0d4b0ded27
[Serialization] Update cloudpickle to v2.0.0 (#19383)
* update cloudpickle to v2.0.0
2021-10-15 02:37:29 -07:00
Hao Zhang
4b92f34ada
[Collective] Remove an unnecessary cuda.stream.synchornize (#19400) 2021-10-14 21:33:59 -07:00
Matti Picus
f372bb07aa
Enable dashboard on Windows (#19319) 2021-10-14 14:42:22 -07:00
architkulkarni
b3ccec5d76
[runtime_env] Fix bug when all working_dir contents are excluded with Ray Client (#19377) 2021-10-14 11:20:45 -07:00
Carlo Grisetti
30fe93d285
[Windows] Use correct interpreter and fix prometheus atomic file rename (#19171) 2021-10-14 10:29:21 -07:00
Eric Liang
13d4ad6100
[data] Preserve epoch by default when using rewindow() (#19359) 2021-10-14 09:17:36 -07:00
SangBin Cho
4edb3c4746
[Test] Add complicated threaded actor tests (#19374)
Why are these changes needed?
There are only 2 simple threaded actor tests in Ray repo. This PR adds more complicated threaded actor tests to make sure it is well tested.

The third tests print a lot of

(pid=42032) [2021-10-13 19:02:36,102 E 42032 10779969] core_worker.cc:270: The global worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
which was the bug @scv119 fixed. Maybe we can start debugging this to make sure when this happens and fix the real shutdown bugs.

Related issue number
Checks
 I've run scripts/format.sh to lint the changes in this PR.
 I've included any doc changes needed for https://docs.ray.io/en/master/.
 I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
 Unit tests
 Release tests
 This PR is not tested :
2021-10-14 09:06:11 -07:00
Antoni Baum
e9df253f5d
[CI/docs] Remove [default] from xgboost-ray (#19186)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:29:55 +01:00
Kai Fricke
9cee83c919
[tune] PBT: Add burn-in period (#19321) 2021-10-14 16:28:29 +01:00
Edward Oakes
888fb24c25
Remove deprecated ray.services package (#18475)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-14 16:28:16 +01:00
Kai Fricke
312dc369a7
Revert "[Hotfix] Revert "[tune/wip] Exclude trial checkpoints in experiment sync"" (#19285)
This reverts commit a92f1fedf4.
and fixes the failing test
2021-10-14 11:18:48 +01:00
Edward Oakes
2ac81f336a
[serve] Remove BackendConfig broadcasting (#19154) 2021-10-13 16:25:34 -07:00
Linsong Chu
b86a5fcb96
[workflow] fix workflow user metadata return when None is given (#19356)
## Why are these changes needed?

Quick fix for metadata put. Currently when workflow-level metadata is not given, it will output `null` to `user_run_metadata.json`, this fix will make it output `{}`.
## Related issue number

original issue: https://github.com/ray-project/ray/issues/17090
original PR: https://github.com/ray-project/ray/pull/19195
2021-10-13 15:52:12 -07:00
architkulkarni
b0716f66ae
[runtime env] Fix handling of runtime env with None fields (#19300) 2021-10-13 13:57:55 -07:00
Antoni Baum
3cb0862152
Fix double gym in requirements (#19357) 2021-10-13 21:43:41 +01:00
Omkar Pangarkar
f1b9b16ae9
[tune] Fix DistributedTrainable restore (#19349) 2021-10-13 21:29:05 +01:00
Carlo Grisetti
da7a485786
[Windows] use dynamic temp path (#19096) 2021-10-13 13:02:45 -04:00
Kai Fricke
bde9e058da
Revert "[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)" (#19351)
This reverts commit 74ee99ff99.
2021-10-13 13:06:36 +01:00
Linsong Chu
ce64e6dc45
[workflow] add metadata put in workflow (#19195)
## Why are these changes needed?

Add metadata to workflow.  Currently there is no option for user to attach any metadata to a step or workflow run, and workflow running metrics (except status) are not captured nor checkpointed.

We are adding various of metadata including:

1. step-level user metadata.  can be set with `step.options(metadata={})`
2. step-level pre-run metadata.  this captures pre-run metadata such as step_start_time, more metrics can be added later.
3. step-level post-run metadata.  this captures post-run metadata such as step_end_time, more metrics can be added later.
4. workflow-level user metadata. can be set with  `workflow.run(metadata={})`
5. workflow-level pre-run metadata.  this captures pre-run metadata such as workflow_start_time, more metrics can be added later.
6. workflow-level post-run metadata.  this captures post-run metadata such as workflow_end_time, more metrics can be added later.

## Related issue number

https://github.com/ray-project/ray/issues/17090

Co-authored-by: Yi Cheng <chengyidna@gmail.com>
2021-10-12 21:01:24 -07:00
Clark Zinzow
1b179adfa1
[Core] [Hotfix] Handle logging redirected to stdout when configuring log file (#19301) 2021-10-12 19:03:21 -07:00
SangBin Cho
84118c9659
Revert "Revert "[Placement Group] Fix the high load bug from the plac… (#19330) 2021-10-12 19:02:30 -07:00
Clark Zinzow
df6d06bd41
Fix for LazyBlockList refactor. (#19333) 2021-10-12 18:18:45 -07:00
Amog Kamsetty
09d8049584
[SGD] Make actor creation async (#19325)
* fix

* fix

* fix
2021-10-12 16:15:59 -07:00
Eric Liang
9f1cd9e867
[docs] Document fake multi-node autoscaler (#19329) 2021-10-12 15:59:07 -07:00
Amog Kamsetty
f6f2435b91
[SGD] Sgd v2 Dataset Integration (#17626)
* wip

* wip

* wip

* draft

* disable tf autosharding

* wip

* wip

* wip

* wip

* add example

* wip

* wip

* wip

* use dataset.split

* add unit tests

* add linear example

* concatenate tensors and fix example

* WIP tune example

* add tensorflow example

* wip

* random_shuffle_each_window

* fault tolerance test

* GPU, examples, CI

* formatting

* fix

* Update python/ray/util/sgd/v2/tests/test_trainer.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* wip

* type hints

* wip

* update user guide

* fix

* fix immediate issues

* update example

* update

* fix tune gpu test

* fix resources for smoke test - 1 CPU for dataset tasks

* update tests, docs, examples

* Apply suggestions from code review

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* address comments

* add warning

* fix tests

* minor doc updates

* update example in doc

* configure tests

* Update doc/source/raysgd/v2/user_guide.rst

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

* Update python/ray/data/dataset.py

Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

* fix docstring

Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
2021-10-12 14:03:10 -07:00
Carlo Grisetti
7651cc782a
Change prometheus warning filename source (#19275)
* Change prometheus warning filename source

* Fix linting
2021-10-12 14:02:51 -07:00
Eric Liang
8c152bd17c
Revert "[Placement Group] Fix the high load bug from the placement group (#19277)" (#19327)
This reverts commit 4360b99803.
2021-10-12 12:41:51 -07:00
Lixin Wei
f2f9c749cb
[Build] Add an Option to Skip Bazel Build (#19265) 2021-10-12 12:01:58 -07:00
Eric Liang
0ab6749602
Support iter_epochs for Datasets (#19217) 2021-10-12 11:05:00 -07:00
SangBin Cho
4360b99803
[Placement Group] Fix the high load bug from the placement group (#19277) 2021-10-12 11:04:14 -07:00
Clark Zinzow
6ca3c02041
[Datasets] Parallelize Parquet metadata fetches. (#19211) 2021-10-12 11:02:30 -07:00
dependabot[bot]
74ee99ff99
[RLlib](deps): Bump tensorflow from 2.5.0 to 2.6.0 in /python/requirements/rllib (#18183)
* [RLlib](deps): Bump tensorflow in /python/requirements/rllib

Bumps [tensorflow](https://github.com/tensorflow/tensorflow) from 2.5.0 to 2.6.0.
- [Release notes](https://github.com/tensorflow/tensorflow/releases)
- [Changelog](https://github.com/tensorflow/tensorflow/blob/master/RELEASE.md)
- [Commits](https://github.com/tensorflow/tensorflow/compare/v2.5.0...v2.6.0)

---
updated-dependencies:
- dependency-name: tensorflow
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* wip.

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: sven1977 <svenmika1977@gmail.com>
2021-10-12 17:56:36 +02:00
SangBin Cho
2c93708324
Migrating to flat hash map [Raylet] (#19220)
* done

* Fix all unit tests

* done

* .

* Fix the build issue

* fix the compilation bug
2021-10-12 07:41:51 -07:00
Wansoo Kim
0f6d4661d7
[tune] Port all MNIST examples to specify data_dir (#19033)
Co-authored-by: Kai Fricke <kai@anyscale.com>
2021-10-12 15:36:06 +01:00
gjoliver
5d14904b9b
[Tune] catch HTTPError when logging to wandb. (#19314) 2021-10-12 14:38:17 +01:00
Kai Fricke
d8d8901192
[ci/tune] Remove deprecated jenkins_only tag from test tags (#19287) 2021-10-12 10:05:46 +01:00
Chris K. W
35230ea9fa
[client] deflake test_stdout_log_stream (#19232)
* deflake test_stdout_log_stream

* add assert message
2021-10-11 22:22:39 -07:00
architkulkarni
cc16e8f8c5
[runtime env] Validate "excludes" field (#19302) 2021-10-11 20:05:22 -07:00