These tests are passing without issues on my Windows machine, so unskipping them to check on CI.
I will push the linting changes separately to execute the test suite twice for confirming that flakyness is removed.
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
Reason for not using `queue.Queue` for multiprocessing purposes on Windows is at https://stackoverflow.com/a/37244276 and in the second reply to https://stackoverflow.com/a/37245300
And reason for using `multiprocessing.JoinableQueue` over `multiprocessing.Queue` is https://stackoverflow.com/a/30725121
AFAIK, this is because in Windows each process gets it own `Queue` and hence nothing is shared among those processes. When `multiprocessing.Queue` is used, changes in it are shared via pipes internally along with proper locks.
Event loop will only be set in main thread by default and this will make workflow unable to work if it's called in thread other than main thread which can happen when it's called from a library (for example ray serve).
This PR fixed it.
The [original PR](https://github.com/ray-project/ray/pull/21864) was [reverted](https://github.com/ray-project/ray/pull/22117) because it caused `torch` (more specifically, `torch>=1.8.1`) to be required to use `ray.train`.
```
| File "ray_sgd_training.py", line 18, in <module>
| from ray import train
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/__init__.py", line 2, in <module>
| from ray.train.callbacks import TrainingCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/__init__.py", line 8, in <module>
| from ray.train.callbacks.profile import TorchTensorboardProfilerCallback
| File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/train/callbacks/profile.py", line 6, in <module>
| from torch.profiler import profile
| ModuleNotFoundError: No module named 'torch.profiler'
```
A [minimal installation test suite](https://github.com/ray-project/ray/pull/22300) was added to detect this. Further, in this PR we make the following changes:
1. Move `TorchWorkerProfiler` to `ray.train.torch` so all torch imports are centralized.
2. Add import validation logic to `TorchWorkerProfiler.__init__` so an exception will only be raised if the user tries to initialize a `TorchWorkerProfiler` without having a valid version of `torch` installed:
```
>>> import ray
>>> import ray.train
>>> import ray.train.torch
>>> from ray.train.torch import TorchWorkerProfiler
>>> twp = TorchWorkerProfiler()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/matt/workspace/ray/python/ray/train/torch.py", line 365, in __init__
"Torch Profiler requires torch>=1.8.1. "
ImportError: Torch Profiler requires torch>=1.8.1. Run `pip install 'torch>=1.8.1'` to use TorchWorkerProfiler.
```
This PR refactors Dataset execution to enable lazy mode in the future, which can reduce memory usage in large-scale ingest pipelines. There should be no behavior changes in this PR. Many of the optimizations are also punted for future work.
We shouldn't promote Runtime Environments as the only way to do things until all Core nightly and release tests are run using runtime environments.
This PR adds the prior approach (using cluster launcher commands) to the doc on equal footing, describing the differences between the two.
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
This PR properly exposes `TableRow` as a public API (API docs + the "Public" tag), since it's already exposed to the user in our row-based ops. In addition, the following changes are made:
1. During row-based ops, we also choose a batch format that lines up with the current dataset format in order to eliminate unnecessary copies and type conversions.
2. `TableRow` now derives from `collections.abc.Mapping`, which lets `TableRow` better interop with code expecting a mapping, and includes a few helpful mixins so we only have to implement `__getitem__`, `__iter__`, and `__len__`.
This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.
To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Updates to address @worldveil's feedback:
Include import train.torch in the docs
Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error.
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter;
with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.
**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
Adds a test to make sure a failed job runtime env creation doesn't hang the cluster (i.e. tasks can still be scheduled on the job, as long as the tasks' runtime env can be created.). Test requested by @rkooo567, good idea!
Closes https://github.com/ray-project/ray/issues/22265
This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.
Currently, Serve deployments must store their class or function definitions in the `Deployment` object's `func_or_class` attribute. However, the declarative API must be able to initiate deployments using only their import path. This allows users to separately define their functions or classes, and pull these functions and classes into their clusters via [remote URIs](https://docs.ray.io/en/releases-1.9.2/handling-dependencies.html#remote-uris). With this change, `Deployment` objects can store an import path string as their `func_or_class`. This import path is then used to import the deployment's code definition when the `Deployment`'s replica is created.
When logs are not intended for the current driver, skip logging warning about too much logs being generated, and clear the counters for log rates.
Ideally the log subscriber should only subscribe to logs from the current job, and system logs. But the change has risk and we can do it in another PR.
Code formatting is disabled in several modules with the explanation
> [The module] ignores yapf because yapf doesn't allow comments right after code blocks,
but we put comments right after code blocks to prevent large white spaces
in the documentation.
Since we no longer use YAPF, it may be possible to re-enable code formatting on
these modules. I've added "FIXME" comments requesting developers to check
whether code formatter appeasements are still necessary.
For consistency and safety, we fix an explicit 6379 port for all default and example configs for Ray on K8s.
Documentation is updated to recommend matching Ray versions in operator and Ray cluster.
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
Ensures that we don't log a warning message about an infeasible resource demand when that custom resource is a placement group (since the placement group resource likely just needs to be created by the raylet).
Co-authored-by: Alex Wu <alex@anyscale.com>
For public SDK APIs, change the import path from
```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```
to
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```
`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
## Diff Summary
Current implementation of DAGNode pre-bind inputs and the signature of `def execute(self)` doesn't take user input yet. This PR extends the interface to take user input, mark DAG entrypoint methods as first stop of all user requests in a DAG. It's needed to unblock next step serve pipeline implementation to serve user requests.
Closes#22196#22197
Notable changes:
- Added a `DAG_ENTRY_POINT` flag in ray dag API to annotate DAG entrypoint functions. Function or class method only. All marked functions will receive identical input from user as first layer of DAG.
- Changed implementations of ClassNode and FunctionNode accordingly to handle different execution for a node marked as entrypoint or not.
- Added a `kwargs_to_resolve` kwarg in the interface of `DAGNode` to handle args that sub-classes need to use to resolve it's implementation without exposing all terms to parent class level.
- This is particularly important for ClassMethodNode binding, so we can have implementations to track method name, parent ClassNode as well as previous class method call without existiting
- Changed implementation of `_copy()` to handle execution of `kwargs_to_resolve`.
- Changed implementation of `_apply_and_replace_all_child_nodes()` to fetch DAGNode type in `kwargs_to_resolve`.
- Added pretty printed lines for `kwargs_to_resolve`
If we run multiple jobs in the same process (this is basically the behavior of python tests), they should be isolated in the sense that system config for job 1 shouldn't affect config for job 2.
```
ray.init(_system_config={})
# job 1
ray.shutdown()
ray.init(_system_config={})
# job 2
ray.shutdown()
```
Currently it's not the case, since RayConfig is a static variable and it's shared across drivers in the same process. This PR resets the configs to default value before applying job specific _system_config.
Note: it's backward incompatible change if user depends on the current behavior but I'm not aware of such case.
The Staroid node provider has been abandoned and unmaintained for quite some time now. Due to the fact that there are no active maintainers, the original contributors cannot be reached, and there is no clear interest, we are no longer officially endorsing or supporting the node provider.
Co-authored-by: Alex Wu <alex@anyscale.com>
This is a down scoped change. For the full overview picture of Tune control loop, see [`Tune control loop refactoring`](https://docs.google.com/document/d/1RDsW7SVzwMPZfA0WLOPA4YTqbRyXIHGYmBenJk33HaE/edit#heading=h.2za3bbxbs5gn)
1. Previously there are separate waits on pg ready and other events. As a result, there are quite a few timing tweaks that are inefficient, hard to understand and unit test. This PR consolidates into a single wait that is handled by TrialRunner in each step.
- A few event types are introduced, and their mapping into scenarios
* PG_READY --> Should place a trial onto it. If somehow there is no trial to be placed there, the pg will be put in _ready momentarily. This is due to historically resources is conceptualized as a pull based model.
* NO_RUNNING_TRIALS_TIME_OUT --> possibly not sufficient resources case
* TRAINING_RESULT
* SAVING_RESULT
* RESTORING_RESULT
* YIELD --> This just means that simply taking very long to train. We need to punt back to the main loop to print out status info etc.
2. Previously TrialCleanup is not very efficient and can be racing between Trainable.stop() and `return_placement_group`. This PR streamlines the Trial cleanup process by explicitly let Trainable.stop() to finish followed by `return_placement_group(pg)`. Note, graceful shutdown is needed in cases like `pause_trial` where checkpointing to memory needs to be given the time to happen before the actor is gone.
3. There are quite some env variables removed (timing tweaks), that I consider OK to proceed without deprecation cycle.