Currently when the "conda" field of runtime_env is specified, we automatically insert the currently running Ray wheel in the conda dependencies (in the nested `pip` list). This Ray wheel is specified by a URL to Amazon S3, which is where we store our Ray wheels.
Unfortunately, currently the M1 wheels are built manually and are uploaded directly to PyPI, and this only happens once for each stable release (in contrast to non-M1 wheels which are auto-built and uploaded to S3 for every commit on master and release branches.). So prior to this PR, if you tried to use the `"conda"` field on M1, it would fail with a message saying it couldn't find the appropriate wheel for the platform.
To fix this, in the case of our Ray cluster running on M1 Mac the only thing we can do for now is to insert `"ray=={ray.__version__}` as our `pip` specifier, instead of the (nonexistent) S3 URL.
The downside of this approach is (1) nightly wheels and wheels built from commits on master remain unsupported for M1, and (2) we cannot end-to-end test this codepath on a new stable version of Ray before that version is actually released to PyPI. However, this PR adds a unit test.
This adds a test for potential resource deadlocks in experiments with heterogeneous PGFs. If the PGF of a later trial becomes ready before that of a previous trial, we could run into a deadlock. This is currently avoided, but untested, flagging the code path for removal in #21387.
After this change in GCS bootstrapping mode, Redis no longer starts and `address` is treated as the GCS address of the Ray cluster.
Co-authored-by: Yi Cheng <chengyidna@gmail.com>
Co-authored-by: Yi Cheng <74173148+iycheng@users.noreply.github.com>
We use `trial.checkpoint` to restore a perturbed trial. Currently trial.checkpoint is looking at both in-memory and persistent checkpoints to find the most recent one. The definition of "the most recent one" is based on iteration. This may no longer be a valid assumption in PBT case, considering `trial_low_quantile` may have an iter=2_persistent_checkpoint as well as a iter=1_in_memory_checkpoint (perturbed from `trial_upper_quantile`).
This PR refactors several components to support switching to GCS address bootstrapping later:
- Treat address from `ray.init()` and `ray` CLI as bootstrap address instead of assuming it is Redis address.
- Ray client servers support `--address` flag instead of `--redis-address`.
- A few other miscellaneous cleanup.
Also, add a test for starting non-head node with `ray start`.
Inheriting from `abc.ABC` is more readable than setting the meta class to `abc.ABCMeta`.
Relevant snippet from the Python 3.4 release notes:
> New class ABC has ABCMeta as its meta class. Using ABC as a base class has essentially the same effect as specifying metaclass=abc.ABCMeta, but is simpler to type and easier to read. (Contributed by Bruno Dupuis in bpo-16049.)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Matthew Deng <matthew.j.deng@gmail.com>
If we use `os.environ` to set environment variables in tests, then our tests become coupled. By using `monkeypatch`, we can safely set environment variables while ensuring our tests remain decoupled.
For more information, see the [monkeypatching documentation](https://docs.pytest.org/en/6.2.x/monkeypatch.html#monkeypatching-environment-variables).
Expands the `to_torch` method for Datasets with:
* An ability to choose to output a list/dict of feature tensors instead of just one (through setting `feature_columns` to be a list of lists or a dict of lists)
* An ability to choose whether the label should be unsqueezed or not
* An ability to pass `None` as the label (for prediction).
Furthermore, this changes how the `feature_column_dtypes` argument works. Previously, it took a list of dtypes for each feature. However, as the tensor was concatenated in the end, only one dtype mattered (the biggest one). Now, this argument expects a single dtype which will be applied to the features tensor (or a list/dict if `feature_columns` is a list of list/dict of lists).
Unit tests for all cases are included.
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
When a list with mixed types is passed to tune.choice, they will be coerced to a single dtype during sampling (due to numpy.choice converting to an array internally). This behaviour is unintentional and surprising. This PR fixes this issue.
This PR contains most of the fixes @iycheng made in #21232, to make tests pass with GCS bootstrapping by supporting both Redis and GCS address as the bootstrap address. The main change is to use address_info["address"] to obtain the bootstrap address to pass to ray.init(), instead of using address_info["redis_address"]. In a subsequent PR, address_info["address"] will return the Redis or GCS address depending on whether using GCS to bootstrap.
This PR is introducing a canonical impl for stopping trials by collecting scattered logic from process_trial_result back into stop_trial. This way, we know what is expected (e.g. what callbacks are invoked and when they are invoked).
This PR will correct the current wrong logic that on_trial_complete callback is invoked before on_trial_checkpoint, which is the source of Syncer clean up issues.
This PR passes gcs address to worker and also update pubsub unit test.
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
This is part of redis removal. This PR enable global accessor to be able to start from gcs
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
This is part of redis removal. This PR enable log monitor and monitor to bootstrap from gcs
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
The run_dir argument in ray.train.backend.BackendExecutor.start_training isn't used but is causing the following error: if your host computer and job cluster use different OS, then you get a pathlib error because, for e.g., you can't instantiate a pathlib.WindowsPath in a Linux system.
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
This is part of gcs ha project. This PR try to bootstrap dashboard with gcs address instead of redis.
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>
This is part of #21129
This PR tries to cover the cpp/ray part of the bootstrap, some updates there:
remove the unused function/tests
some API updates
Co-authored-by: mwtian <81660174+mwtian@users.noreply.github.com>