The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.
Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally.
In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes.
Buffer Size | Runtime (s)
-- | --
Default | 3221.865916
256KB | 1758.885839
1MB | 748.226089
10MB | 526.406466
100MB | 494.830513
Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size.
For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes.
Prefixes | Runtime (s)
-- | --
1 | 748.226089
3 | 527.658646
10 | 516.010742
Together these changes enable faster large-scale object spilling.
Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>
Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules.
The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated.
The follow-up PRs would be migrating in core worker, placement group and others.
This PR is part of resource reporting refactoring. In this PR ray syncer is moved from gcs_resource_manager to gcs_placement_group_scheduler. With this one, gcs_resource_manager is totally decoupled from resource broadcasting.
Mainly the following things:
- This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message.
- validate whole runtime env when serialize runtime_env.
- overload method `__setitem__` to parse and validate field when it has to modify.
Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).
As discussed,
- Removes ConvertibleToTrainable interface and makes as_trainable part of the Trainer interface
- Moves Trainer interface to ray.ml.trainer from ray.ml.train.trainer
This PR supports the job-based file manager and runner. It will be the backbone of k8s migration.
The PR handles edge cases that originally existed in the old e2e.py job-based runners.
The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.
This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota.
Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.
This PR exposes the new checkpoint interface, implemented in #22691, to end users. It does this by replacing the old external facing TrialCheckpoint class with a merged class that supports the old TrialCheckpoint API (upload, download, save) as well as the new Checkpoint API.
With this PR, users can use the new Checkpoint interface for downstream processing of their Ray Tune results. In a follow-up PR, the new Checkpoint interface will be used internally within Ray Tune and Train for bookkeeping, however, that is not required to unblock the Ray ML use case.
Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor.
horovod/horovod@a729ba7
The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune.
This commit updates the Horovod RayExecutor init args.
Co-authored-by: Kai Fricke <kai@anyscale.com>