Commit graph

11456 commits

Author SHA1 Message Date
Hao Chen
78597d3089
[train] Minor fixes on Ray Train user guide doc (#22379)
Fixes some typos and format issues.
2022-02-15 10:09:27 -08:00
Matti Picus
199bf558e2
move slow test from small (timeout 60s) to medium (timeout 300s) (#22167) 2022-02-15 09:55:30 -08:00
Gagandeep Singh
7dc097a947
Unskipped tests for Windows (#21809)
These tests are passing without issues on my Windows machine, so unskipping them to check on CI.
I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed.

Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
2022-02-15 09:04:47 -08:00
Gagandeep Singh
a8341dfc29
Replace queue.Queue with multiprocessing.JoinableQueue (#21860)
Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300
And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121

AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.
2022-02-15 09:01:17 -08:00
ZhuSenlin
37ef372a10
Use shared_ptr to instead of object in cluster_scheduling_resources_ to reduce rehash cost. (#22376)
1. In scheduling optimization, we should encapsulate `SchedulingResources`, `GcsNodeInfo` and other node related information into a `NodeContext` for use, which requires that `SchedulingResources` is shareable. This PR does not involve the transformation logic of `NodeContext`, but only transforms `SchedulingResources` into shareable.
2. `cluster_scheduling_resources_` holds raw object of `SchedulingResources`, which will bring some overhead when rehash (even though the std::move used when rehash).

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-02-15 23:43:59 +08:00
Kai Fricke
c866131cc0
[tune] Retry cloud sync up/down/delete on fail (#22029) 2022-02-15 12:27:29 +00:00
Jun Gong
b729a9390f
[RLlib] Add example commands for using setup-dev.py with RLlib for improved dev setup stability and developer experience. (#22380) 2022-02-15 12:00:36 +01:00
dependabot[bot]
35ae459434
[tune](deps): Bump flaml from 0.6.7 to 0.9.7 in /python/requirements/ml (#22071)
* [tune](deps): Bump flaml from 0.6.7 to 0.9.6 in /python/requirements/ml

Bumps [flaml](https://github.com/microsoft/FLAML) from 0.6.7 to 0.9.6.
- [Release notes](https://github.com/microsoft/FLAML/releases)
- [Commits](https://github.com/microsoft/FLAML/compare/v0.6.7...v0.9.6)

---
updated-dependencies:
- dependency-name: flaml
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-02-15 09:53:58 +00:00
Sven Mika
5ca6a56e16
[RLlib] Bug fix: eval-workers in offline RL setup have no env, even though eval_config includes env key. (#22350) 2022-02-15 09:32:43 +01:00
Akash Patel
ae6068277b
update grpc to 1.43 (#21866)
add patch for newer setuptools, can be removed once grpc 1.44 is release

Why are these changes needed?
With grpc updated to 1.43, one of the patches is not needed.

Patch needed when building locally for newer setuptools version. See grpc/grpc#28392 for more details.
Also needed as a prereq to #21221
2022-02-15 00:20:56 -08:00
mwtian
59d9e20a4c
Revert "Revert "[Release 1.11.0][Core] avoid unnecessary work during event st… (#22144)" (#22284)
This reverts commit 6235b6d7e9.

Looks like windows://python/ray/tests:test_dataclient_disconnect has similar level of flakiness as before the revert. This seems unrelated and the test needs to be fixed in another way.
2022-02-15 00:20:28 -08:00
Jun Gong
6f5afcbce9
[RLlib] Docs enhancements: Setup-dev instructions; Ray datasets integration. (#22239) 2022-02-15 09:09:24 +01:00
Steven Morad
5d52b599aa
[RLlib] Fix zero gradients for ppo-clipped vf (#22171) 2022-02-15 08:57:18 +01:00
Yi Cheng
2fbbd21351
[workflow] Fix event loop can't find in thread (#22363)
Event loop will only be set in main thread by default and this will make workflow unable to work if it's called in thread other than main thread which can happen when it's called from a library (for example ray serve).
This PR fixed it.
2022-02-14 23:31:32 -08:00
matthewdeng
8f9e0d7f6b
[train] add TorchTensorboardProfilerCallback (#22345)
The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.

```
  | File "ray_sgd_training.py", line 18, in <module>
  | from ray import train
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
  | from ray.train.callbacks import TrainingCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
  | from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
  | File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
  | from torch.profiler import profile
  | ModuleNotFoundError: No module named 'torch.profiler'
```

A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:

```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
    "Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
2022-02-14 16:16:55 -08:00
Eric Liang
35a157948e
Lay the groundwork for lazy dataset optimization (no behavior changes) (#22233)
This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.
2022-02-14 15:03:58 -08:00
Jialing He
192f9de421
[runtime env] Introduce async Manager.create (#22311) 2022-02-14 16:26:47 -06:00
Matti Picus
845861fdc1
[runtime env] use pytest tmp_path, os.path.sep, and unskip most tests for windows (#22342) 2022-02-14 16:04:10 -06:00
Archit Kulkarni
0e350c0074
[runtime env] [Doc] Add two ways of installing dependencies: cluster launcher, and runtime env (#20780)
We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments. 

This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two.

Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2022-02-14 16:03:48 -06:00
Clark Zinzow
53c4c7b1be
[Datasets] Expose TableRow as public API; minimize copies/type conversions on row-based ops. (#22305)
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
2022-02-14 12:56:17 -08:00
dependabot[bot]
767b349b99
[data](deps): Bump dask[complete] (#22334)
Bumps [dask[complete]](https://github.com/dask/dask) from 2022.1.0 to 2022.2.0.
- [Release notes](https://github.com/dask/dask/releases)
- [Changelog](https://github.com/dask/dask/blob/main/docs/release-procedure.md)
- [Commits](https://github.com/dask/dask/compare/2022.01.0...2022.02.0)

---
updated-dependencies:
- dependency-name: dask[complete]
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-02-14 12:44:20 -08:00
Edward Oakes
610930ae6a
[serve] Improve health check failure semantics (#22297) 2022-02-14 14:04:03 -06:00
Yi Cheng
8a3bd6c275
[gcs/ha] Enable HA flags by default (#21608)
PR to enable all three flags for GCS HA:
- RAY_bootstrap_with_gcs=1 
- RAY_gcs_grpc_based_pubsub=1 
- RAY_gcs_storage=memory
2022-02-14 11:13:17 -08:00
Clark Zinzow
443416907e
[Datasets] Fix boolean tensor column representation and slicing. (#22323)
This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.
2022-02-14 10:36:35 -08:00
Chen Shen
db5de9c35c
[scheduler-refactor 2/n] move actor reporting into helper class too (#22333)
* move this

* address comments
2022-02-14 02:13:14 -08:00
Alex Wu
276ff2b7ed
[docs][autoscaler] Add maintainers for node providers (#22237)
This PR adds documentation for the maintainers of the various node providers.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-02-12 11:31:32 -08:00
Max Pumperla
d594b668bb
[docs] [tune] hyperopt notebook (#22315) 2022-02-12 02:46:03 -08:00
Eric Liang
85d6946c95
Split test_dataset.py into two (#22303) 2022-02-12 00:21:25 -08:00
Amog Kamsetty
4cbbc81f4c
[Train] Add support for trainer.best_checkpoint and Trainer.load_checkpoint_path (#22306)
Closes #22226
2022-02-11 22:29:37 -08:00
SangBin Cho
640d92c385
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
2022-02-12 11:58:58 +09:00
Yi Cheng
531e215921
[gcs] Fix in_memory_store not handling nullptr callback issue (#22321)
in memory store is not handling the nullptr callback well which leads to gcs crash in node failure tests. This PR fixed it.
2022-02-11 18:35:40 -08:00
Kaushik B
8515fdd6db
[tune] Update Lightning examples to support PTL 1.5 (#20562)
To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407

Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
2022-02-11 17:45:06 -08:00
Amog Kamsetty
e8e35169c6
[Train] Allow train methods to be called outside of the session (#21969)
Updates to address @worldveil's feedback:

Include import train.torch in the docs
Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error.

Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
2022-02-11 17:42:55 -08:00
jialin
851b853352
add optional empty lines filter in read_text (#22298)
ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter; 
with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False.

Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>
2022-02-11 14:49:45 -08:00
Jun Gong
cbd24503b6
[RLlib] Add A3C to RLlib performance regression tests. (#22316) 2022-02-11 21:18:53 +01:00
Edward Oakes
49b3e6c53c
[serve] Support user-provided health check via def check_health(self) method (#22178) 2022-02-11 12:53:37 -06:00
matthewdeng
2c204a755b
[train] add minimal installation test suite (#22300)
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.

**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
2022-02-11 10:09:00 -08:00
Archit Kulkarni
da57012cbc
Add comment to periodic CI pipeline to update release process doc when updating test suites (#22037)
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.

This is an action item from the Ray 1.10.0 release retrospective.
2022-02-11 11:14:24 -06:00
Archit Kulkarni
a65f35b867
[Doc] [Jobs] Add ray dashboard docs to jobs doc (#22222)
To use Jobs on a remote cluster, you need to set up port forwarding.  When using the cluster launcher, the `ray dashboard` command provides this automatically.  This PR adds a how-to to the docs for this feature.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
2022-02-11 11:01:37 -06:00
Archit Kulkarni
1c0c2aaba2
[runtime env] Add test for scheduling task after failed job env (#22224)
Adds a test to make sure a failed job runtime env creation doesn't hang the cluster (i.e. tasks can still be scheduled on the job, as long as the tasks' runtime env can be created.).  Test requested by @rkooo567, good idea!
2022-02-11 11:01:16 -06:00
ZhuSenlin
358771c636
Optimize MultiItemCallback and MapCallback to reduce data copy when GCS load data after restart (#22307)
After GCS restarts, metadata will be loaded from redis. Now redis callback returns const &, which requires a copy of the loaded data. After modifying to && and then using std::move, data copy can be reduced.
2022-02-11 16:57:16 +08:00
Chen Shen
bb6cb0898b
[Dataset] avoid pyarrow 7.0.0 for dataset (#22253) 2022-02-11 00:32:47 -08:00
Eric Liang
02add259ca
Add more details to the internal error for "worker cannot find registered function" (#22302)
This adds some more debug information for this internal error that shouldn't happen.
2022-02-10 23:20:17 -08:00
Qing Wang
49d725b0c7
[Java] Add Java release guideline. (#22288)
Add Java release guideline to help us release Ray Java process.
2022-02-11 14:56:20 +08:00
Edward Oakes
dd097b7a9b
[serve] Fix HTTP proxy controller namespace bug (#22287)
Closes https://github.com/ray-project/ray/issues/22265

This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.
2022-02-10 21:05:35 -06:00
Clark Zinzow
13c8e10b3b
[Datasets] Unrevert NaN handling. (#22291)
Reverts #22258, unreverting #20787. 

The fix is in the ["Fix tests" commit](b559da2407), where we switch to using the test utility DataFrame equality comparison which properly handles NaN comparisons. The underling cause of this test break is explained [here](https://github.com/ray-project/ray/pull/22258#issuecomment-1035404700).
2022-02-10 16:19:53 -08:00
Amog Kamsetty
09e46066eb
[Train] Fix accuracy calculation for CIFAR example (#22292)
Same as #21689 except for cifar
2022-02-10 15:06:31 -08:00
Archit Kulkarni
94f73de23c
Revert "[Serve] [Windows] Unskip all but test_redeploy_single_replica in test_deploy.py (#21391)" (#22299)
This reverts commit 000c56f764.
2022-02-10 14:49:34 -08:00
Chen Shen
0866a5558f
[Dataset][nighlyt-test] pin pyarrow==4.0.1 for dataset related tests (#22277)
* pin pyarrow==4.0.1

* address comments
2022-02-10 14:22:41 -08:00
Edward Oakes
48adb6f7bb
[serve] Introduce DeploymentStatus, poll for statuses instead of using async goals (#22121) 2022-02-10 12:33:04 -08:00