Adds a streaming based reading option for Snappy-compressed files. Arrow doesn't support streaming Snappy decompression since the canonical C++ Snappy library doesn't natively support streaming decompression. This PR works around this by doing streaming reads of snappy-compressed files using the streaming decompression API provided in the [python-snappy](https://github.com/andrix/python-snappy) package.
This commit supplies a custom datasource that uses Arrow + [python-snappy](https://github.com/andrix/python-snappy) to read and decompress Snappy-compressed files.
Co-authored-by: siddharth.goel <siddharth.goel@bytedance.com>
Co-authored-by: Chen Shen <scv119@gmail.com>
This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html).
These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.
`Application` stores a group of deployments and can write them to a YAML config. However, this requires the deployments to use import paths as their `func_or_class`. This change make all deployments in an `Application` store only import paths as the `func_or_class`.
This change also adds a utility function to get a deployment's import path. This utility function is used in the DeploymentNode for Pipelines.
Interface for DataParallelTrainer and updates to ScalingConfig definition.
Depends on #22986
Co-authored-by: Eric Liang <ekhliang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>
the include of content for md files like our central getting started page didn't render. fixed here.
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later).
I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.
Currently, when a spill/restore worker fails and the state of it in the worker pool is idle, the worker pool will not clean up the metadata of the worker. Subsequent spill/restore requests will reuse this dead worker and RPC requests cannot succeed. This results in broken object spilling functionality.
This PR addresses the issue by removing disconnected IO workers from `registered_io_workers` and `idle_io_workers`.
`PATH` is easy to be changed in a terminal session. Different `$PATH` values lead to miss of bazel cache. e.g. `pip install python -e` and `bazel build //:all` don't share cache because Python modifies `PATH`.
`LC_ALL`, `LANG`, and python-related environment variables are only used by C++ worker tests, which invokes the `ray start` command when running tests with `bazel test`. Java worker is not affected because we don't use `bazel test` to run Java tests. So these env variables should stay `test_env`, not `action_env`.
This PR can greatly improve the cache hit rate of Bazel build and test.
Previously DatasetPipeline stages were executed by one actor each, which compromised fault tolerance through lineage reconstruction. This centralizes all task submission at the pipeline coordinator to improve fault tolerance. To preserve pipeline parallelism, the stages are executed by a threadpool. To clean up the threadpool, the pipeline coordinator adds any running threads to a global set that is checked by the threads during `ray.wait`.
Note that this will only provide fault tolerance for split pipes if all pipeline consumers stay alive. It will not work if one of the consumers dies and restarts because next_dataset_if_ready is not idempotent.
Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>
Adds the following interfaces (without implementation, for discussion / approval):
- `serve.Application`
- `serve.DeploymentNode`
- `serve.DeploymentMethodNode`, `serve.DAGHandle`, and `serve.drivers.PipelineDriver`
- `serve.run` & `serve.build`
In addition to these Python APIs, we will also support the following CLI commands:
- `serve run [--blocking=true] my_file:my_node_or_app # Uses Ray client, blocking by default.`
- `serve build my_file:my_node output_path.yaml`
- `serve deploy [--blocking=false] # Uses REST API, non-blocking by default.`
- `serve status [--watch=false] # Uses REST API, non-blocking by default.`
Adds an ability for users to specify a custom results post-processing function that will be applied to metrics before they are reported to Tune in XGBoost/LightGBM integration callbacks, allowing for support for xgb.cv/lgbm.cv. Updates example to show it in action and in CI.
Currently, all buildkite runs report per default. Instead, we only want to report when running scheduled builds or when specifically overriding this behavior.
Infra errors are tackled with concurrency groups. Thus we can disable old mitigation methods like automatic infra retry for now.
We keep the script as it does other logic (e.g. checkout local test branch) and infra retry can be enabled via env variable if needed.
This PR improves broken k8s tests.
Use exponential backoff on the unstable HTTP path (getting job status sometimes has broken connection from the server. I couldn't really find the relevant logs to figure out why this is happening, unfortunately).
Fix benchmark tests resource leak check. The existing one was broken because the job submission uses 0.001 node IP resource, which means the cluster_resources can never be the same as available resources. I fixed the issue by not checking node IP resources
K8s infra doesn't support instances < 8 CPUs. I used m5.2xlarge instead of xlarge. It will increase the cost a bit, but it wouldn't be very big.
Fixes potential error if function not found in azure sdk when deploying ray cluster on azure
Adds additional python package needed to deploy ray cluster on azure in docs
Co-authored-by: Scott Graham <scgraham@microsoft.com>