This adds the following options to DatasetConfig, which can be used to enable streaming ingest.
```
# Whether the dataset should be streamed into memory using pipelined reads.
# When enabled, get_dataset_shard() returns DatasetPipeline instead of Dataset.
# The amount of memory to use is controlled by `stream_window_size`.
# False by default for all datasets.
use_stream_api: Optional[bool] = None
# Configure the streaming window size in bytes. A typical value is something like
# 20% of object store memory. If set to -1, then an infinite window size will be
# used (similar to bulk ingest). This only has an effect if use_stream_api is set.
# Set to 1.0 GiB by default.
stream_window_size: Optional[float] = None
# Whether to enable global shuffle (per pipeline window in streaming mode). Note
# that this is an expensive all-to-all operation, and most likely you want to use
# local shuffle instead.
# False by default for all datasets.
global_shuffle: Optional[bool] = None
```
When running an experiment for example in the cloud and syncing to a bucket the logdir path in the trials will be changed when working with the checkpoints in the bucket. There are some workarounds, but the easier solution is to also add a rel_logdir containing the relative path to the trials/checkpoints that can handle any changes in the location of experiment results.
As discussed with @Yard1 and @krfricke
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
Co-authored-by: Kai Fricke <kai@anyscale.com>
Add visibility into the following to help Ray users and developers debug performance and OOM issues:
Raylet memory usage broken down by USS vs remaining RSS.
Total workers' count, CPU percentage usage, and memory usage.
In the same spirit of #25479 adding myself and @DmitriGekhtman as code owners of the autoscaler/cluster launcher docs since we are also the code owners for the code.
Dataset.to_tf and TensorflowPredictor attempt to convert Pandas dataframes to NumPy arrays by calling DataFrame.values. However, DataFrame.values fails if the dataframe contains multidimensional arrays.
This PR solves this problem by introducing a function convert_pandas_to_tf_tensor. The implementation of the function is based on the implementation of convert_pandas_to_torch_tensor.
From the message:
```
[ OK ] SyncerTest.TestMToN (13132 ms)
[----------] 5 tests from SyncerTest (43175 ms total)
[----------] Global test environment tear-down
[==========] 8 tests from 2 test suites ran. (43176 ms total)
[ PASSED ] 8 tests.
external/com_github_grpc_grpc/src/core/lib/iomgr/ev_posix.cc:314:19: runtime error: member access within null pointer of type 'const struct grpc_event_engine_vtable'
```
This can only be reproduced by running with Bazel test so far. With gdb, it won't be reproduced. It seems like some issue with the grpc maybe the reactor API.
Given that the ASAN test, which is supposed to catch the issue, runs well, and a considerable time has been spent investigating this one but no progress, skip this test for now.
This is the PR to implement ray log to the server side. The PR is continued from #24068.
The PR supports two endpoints;
/api/v0/logs # list logs of the node id filtered by the given glob.
/api/v0/logs/{[file | stream]}?filename&pid&actor_id&task_id&interval&lines # Stream the requested file log. The filename can be inferred by pid/actor_id/task_id
Some tests need to be re-written, I will do it soon.
As a follow-up after this PR, there will be 2 PRs.
PR to add actual CLI
PR to remove in-memory cached logs and do on-demand query for actor/worker logs