We often need to convert our internal `_TrackedCheckpoint` objects to `ray.air.Checkpoint`s - we should do this in a utility method in the `_TrackedCheckpoint` class.
This PR also fixes cases where we haven't resolved the saved checkpoint futures, yet.
#25575 starts all Serve actors in the `"serve"` namespace. This change updates the Serve documentation to remove now-outdated explanations about namespaces and to specify that all Serve actors start in the `"serve"` namespace.
This is the first step to enable the parallel tests for ray core ci. To reduce the noise this test only add the option and not enable them. Parallel CI can be 40%-60% faster compared with running them one-by-one.
We'll enable them by bk jobs one-by-one.
Prototype here #25612
When objects that are spilled or being spilled are freed, we queue a request to delete them. We only clear this queue when additional objects get freed, so GC for spilled objects can fall behind when there is a lot of concurrent spilling and deletion (as in Datasets push-based shuffle). This PR fixes this to clear the deletion queue if the backlog is greater than the configured batch size. Also fixes some accounting bugs for how many bytes are pinned vs pending spill.
Related issue number
Closes#25467.
## Why are these changes needed?
We currently only record usage stats from drivers. This can lose some of information when libraries are imported from workers (e.g., doing some rllib import from trainable).
@jjyao just for the future reference.
When GCS restarts, it'll recover the placement group and make sure no resource is leaking. The protocol now is like:
- Sending the committed PGs to raylets
- Raylets will check whether any worker is using resources from the PG not in this group
- If there is any, it'll kill that worker.
Right now there is a bug, which will kill the worker using bundle index equals -1.
Un-reverting https://github.com/ray-project/ray/pull/24934 which caused `test_cluster` to become flaky. This was due to an oversight: we need to update the `HTTPState` logic to account for the controller not necessarily running on the head node.
This will require using the new `SchedulingPolicy` API, but I'm not quite sure the best way to do it. Context here: https://github.com/ray-project/ray/issues/25090.
Followup PR to https://github.com/ray-project/ray/pull/20273.
- Hides cache logic behind a class.
- Adds "name" field to runtime env plugin class and makes existing conda, pip, working_dir, and py_modules inherit from the plugin class.
Future work will unify the codepath for these "base plugins" with the codepath for third-party plugins; currently these are different, and URI support is missing for third-party plugins.
This is a follow-up to the previous PR (GitHub did some funky things when I did a rebase, so I had to create a new one)
On Windows systems, the `exec_worker` method may fail due to spaces being present in arguments that are file paths. This addresses said issue.
Unfortunately, ray.data.read_parquet() doesn't work with multiple directories since it uses Arrow's Dataset abstraction under-the-hood, which doesn't accept multiple directories as a source: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.dataset.html
This PR makes this clear in the docs, and as a driveby, adds ray.data.read_parquet_bulk() to the API docs.
Push-based shuffle has some extra metadata involving merge and reduce tasks. Previously we were serializing an O(n) (n = reduce tasks) metadata and sending this to tasks, which caused a lot of unnecessary plasma usage on the head node. This PR splits up the metadata into parts that can be kept on the driver and a relatively cheap part that is sent to all tasks.
Related issue number
One of the issues needed for #24480.
Adds a _transform_arrow method to Preprocessors that allows them to implement logic for arrow-based Datasets.
- If only _transform_arrow is implemented, will convert the data to arrow.
- If only _transform_pandas is implemented, will convert the data to pandas.
- If both are implemented, will pick the method corresponding to the format for best performance.
Implementation is defined as overriding the method in a sub-class.
This is only a change to the base Preprocessor class. Implementations for sub-classes will come in the future.
this is a temp fix of #25556. When the dtype from the pandas dataframe gives object, we set the dtype to be None and make use of the auto-inferring of the type in the conversion.
Why are these changes needed?
Wasn't able to build Ray following these instructions : https://docs.ray.io/en/latest/ray-contribute/development.html#building-ray
It fails to run
pip install -e . --verbose # Add --user if you see a permission denied error.
I have a local installation of protobuf via Homebrew and bazel is using its headers against the protobuf that it is pulling into its sandbox. It is a known issue with bazel and one of the workarounds is to block the local dir so it doesn't accidentally pick up the header if someone happens to have it installed locally.
Manually tested by running
bazel build --verbose_failures --sandbox_debug -- //:ray_pkg
Without the fix I would get an error similar to https://gist.github.com/clarng/ff7b7bf7672802d1e3e07b4b509e4fc8
With the fix it builds