It seems like the S3 read sometimes fails; #22214. I found out the file actually does exist in S3, so it is highly likely a transient error. This PR adds a retry mechanism to avoid the issue.
To helps resolve the issues users are facing with running Lightning examples with Ray Tune PyTorchLightning/pytorch-lightning#10407
Co-authored-by: Amog Kamsetty <amogkamsetty@yahoo.com>
Updates to address @worldveil's feedback:
Include import train.torch in the docs
Allow methods in session.py to be called outside of the session with sensible defaults. These will no longer raise an error.
Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>
ray.data.read_text() currently doesn't take care of empty lines; this pr adds a flag to enable the empty line filter;
with this change, read_text will only return non-empty line by default, unless otherwise setting drop_empty_line to False.
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Jialin Liu <jialin.liu@bytedance.com>
Adding a minimal test suite to catch any regressions from accidentally adding backend imports (e.g. `torch`, `tensorflow`, `horovod`) to the main import path.
**Example:** If I'm running Ray Train with `tensorflow`, I should not be required to have `torch` installed.
This PR adds a comment to build_pipeline.py reminding anyone who makes changes to the test suites to also update the release process doc if necessary.
This is an action item from the Ray 1.10.0 release retrospective.
To use Jobs on a remote cluster, you need to set up port forwarding. When using the cluster launcher, the `ray dashboard` command provides this automatically. This PR adds a how-to to the docs for this feature.
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Adds a test to make sure a failed job runtime env creation doesn't hang the cluster (i.e. tasks can still be scheduled on the job, as long as the tasks' runtime env can be created.). Test requested by @rkooo567, good idea!
After GCS restarts, metadata will be loaded from redis. Now redis callback returns const &, which requires a copy of the loaded data. After modifying to && and then using std::move, data copy can be reduced.
Closes https://github.com/ray-project/ray/issues/22265
This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.
Currently, Serve deployments must store their class or function definitions in the `Deployment` object's `func_or_class` attribute. However, the declarative API must be able to initiate deployments using only their import path. This allows users to separately define their functions or classes, and pull these functions and classes into their clusters via [remote URIs](https://docs.ray.io/en/releases-1.9.2/handling-dependencies.html#remote-uris). With this change, `Deployment` objects can store an import path string as their `func_or_class`. This import path is then used to import the deployment's code definition when the `Deployment`'s replica is created.
When logs are not intended for the current driver, skip logging warning about too much logs being generated, and clear the counters for log rates.
Ideally the log subscriber should only subscribe to logs from the current job, and system logs. But the change has risk and we can do it in another PR.
Code formatting is disabled in several modules with the explanation
> [The module] ignores yapf because yapf doesn't allow comments right after code blocks,
but we put comments right after code blocks to prevent large white spaces
in the documentation.
Since we no longer use YAPF, it may be possible to re-enable code formatting on
these modules. I've added "FIXME" comments requesting developers to check
whether code formatter appeasements are still necessary.
For consistency and safety, we fix an explicit 6379 port for all default and example configs for Ray on K8s.
Documentation is updated to recommend matching Ray versions in operator and Ray cluster.
After this PR (https://github.com/ray-project/ray/pull/22156), for some reasons the driver script has some string that cannot be encoded with ascii. It seems like using utf-8 solves the problem.
Ensures that we don't log a warning message about an infeasible resource demand when that custom resource is a placement group (since the placement group resource likely just needs to be created by the raylet).
Co-authored-by: Alex Wu <alex@anyscale.com>
Previously it wasn't obvious which working_dir option was recommended, and the size limit for local working_dir didn't appear on the Jobs page. (The user would have had to go to the runtime_env API reference to see the size limit.). This PR makes this information more prominent.
For public SDK APIs, change the import path from
```python
from ray.dashboard.modules.job.common import JobStatus, JobStatusInfo
from ray.dashboard.modules.job.sdk import JobSubmissionClient
```
to
```python
from ray.job_submission import JobStatus, JobSubmissionClient
```
`JobStatus`, `JobStatusInfo` and `JobSubmissionClient` were the only names referenced in the SDK doc so far, but we can add more later as they appear.
## Diff Summary
Current implementation of DAGNode pre-bind inputs and the signature of `def execute(self)` doesn't take user input yet. This PR extends the interface to take user input, mark DAG entrypoint methods as first stop of all user requests in a DAG. It's needed to unblock next step serve pipeline implementation to serve user requests.
Closes#22196#22197
Notable changes:
- Added a `DAG_ENTRY_POINT` flag in ray dag API to annotate DAG entrypoint functions. Function or class method only. All marked functions will receive identical input from user as first layer of DAG.
- Changed implementations of ClassNode and FunctionNode accordingly to handle different execution for a node marked as entrypoint or not.
- Added a `kwargs_to_resolve` kwarg in the interface of `DAGNode` to handle args that sub-classes need to use to resolve it's implementation without exposing all terms to parent class level.
- This is particularly important for ClassMethodNode binding, so we can have implementations to track method name, parent ClassNode as well as previous class method call without existiting
- Changed implementation of `_copy()` to handle execution of `kwargs_to_resolve`.
- Changed implementation of `_apply_and_replace_all_child_nodes()` to fetch DAGNode type in `kwargs_to_resolve`.
- Added pretty printed lines for `kwargs_to_resolve`