Commit graph

7141 commits

Author SHA1 Message Date
Kai Fricke
e1a7efe148
[tune] Use Checkpoint.to_bytes() for store_to_object (#25805)
We currently use our own serialization to ship checkpoints as objects. Instead we should use the Checkpoint class. This PR also adds support to create results from checkpoints pointing to object references.

Depends on #26351

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-08 18:01:20 +01:00
Antoni Baum
0e259ff844
[tune] Fix SyncerCallback having a size limit (#26371)
#25655 refactored syncing but it introduced a regression - previously, dirs of any size could have been synced, but now only dirs below the default limit of 1 GB can be. This PR fixes this regression allowing dirs of any size to be synced.
2022-07-08 17:58:41 +01:00
Kai Fricke
86b9b4b7a5
[air] Serialize additional files in dict checkpoints turned dir checkpoints (#26351)
With this PR, files put into directory checkpoints that were dict checkpoints will be serialized and retained when a subsequent to_dict() is called. This is to enable storing additional files, as e.g. needed by Ray Tune.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-08 10:03:16 +01:00
Jiajun Yao
743e2f403a
Set RAY_USAGE_STATS_EXTRA_TAGS for release tests (#26366)
- Record the test name for the usage stats.
- Change the cluster name to indicate if it's smoke test or not.
2022-07-07 21:17:34 -07:00
Cheng Su
4e674b6ad3
[Datasets] Update docs for drop_columns and fix typos (#26317)
We added drop_columns() API to datasets in #26200, so updating documentation here to use the new API - doc/source/data/examples/nyc_taxi_basic_processing.ipynb. In addition, fixing some minor typos after proofreading the datasets documentation.
2022-07-07 17:17:33 -07:00
Antoni Baum
ea94cda1f3
[AIR] Replace train. with session. (#26303)
This PR replaces legacy API calls to `train.` with AIR `session.` in Train code, examples and docs.

Depends on https://github.com/ray-project/ray/pull/25735
2022-07-07 16:29:04 -07:00
Yi Cheng
f2f1086868
[serve] Add healthz endpoint for HttpProxy (#26347) 2022-07-07 14:01:42 -07:00
Antoni Baum
b9a4f64f32
[AIR/train] Use new Train API (#25735)
Uses the new AIR Train API for examples and tests.

The `Result` object gets a new attribute - `log_dir`, pointing to the Trial's `logdir` allowing users to access tensorboard logs and artifacts of other loggers.

This PR only deals with "low hanging fruit" - tests that need substantial rewriting or Train user guide are not touched. Those will be updated in followup PRs.

Tests and examples that concern deprecated features or which are duplicated in AIR have been removed or disabled.

Requires https://github.com/ray-project/ray/pull/25943 to be merged in first
2022-07-07 12:28:37 -07:00
Jun Gong
b23642473b
[Datasets] When getting a column's value from a PandasRow, catch ValueError (#26278)
Otherwise, things won't work for columns that has an ndarray as the value.
2022-07-07 09:55:03 -07:00
SangBin Cho
2dd5fdfdf1
[Usage stats] Add tags & number of nodes to the report. (#25852)
This PR adds the RAY_EXTRA_USAGE_TAGS to add additional tag metadata + number of nodes to the report.
2022-07-07 08:31:04 -07:00
Kai Fricke
9b49417a72
[ci/hotfix] Pin raydp-nightly (#26358)
Alternative to #26356 - here we just pin raydp-nightly and resolve the dependency issues in follow-up PRs.

This is to quickly unblock CI.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-07-07 14:54:01 +01:00
Kai Yang
e31baebc4e
[Core] Fix WaitManager dealing with duplicate objects (#26256)
When calling an actor method with duplicate ObjectRefs, the actor method will never be executed. The root cause is that `WaitRequest::ready` is of type `std::unordered_set` rather than `std::vector`.

b9ade079cb/src/ray/raylet/wait_manager.h (L77)

So the below if conditions won't be true.

b9ade079cb/src/ray/raylet/wait_manager.cc (L45-L48)

b9ade079cb/src/ray/raylet/wait_manager.cc (L103-L105)

The bug was introduced by https://github.com/ray-project/ray/pull/21369, so it exists in Ray 1.11.0+.
2022-07-07 15:14:09 +08:00
brucez-anyscale
f76d7b23f2
Revert "Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent"" (#26336) 2022-07-06 19:37:30 -07:00
Siyuan (Ryans) Zhuang
b803792b58
[workflow] Standardize workflow blocking and nonblocking APIs (#26318)
This PR unified the semantics of some workflow APIs.

Those workflow APIs acts on workflow tasks so they could be blocked for a long time. So we have both the blocking and non-blocking versions for them: xxx for blocking and xxx_async for non-blocking APIs.
2022-07-06 13:35:36 -07:00
Yi Cheng
12d147ff1f
Revert "[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)" (#26333)
This reverts commit 84166ccb04.
2022-07-06 13:30:33 -07:00
Peyton Murray
ea47d97a54
[Core] Add HTML reprs for ClientContext and WorkerContext (#25730) 2022-07-06 12:19:19 -07:00
SangBin Cho
079ae9f013
[Test] Fix flaky OSX shuffle (#26158)
Seems like the last RPC is failing after shuffle succeeds. Adding retry to fix the issue.
2022-07-06 11:16:09 -07:00
brucez-anyscale
84166ccb04
[Dashboard][Serve] Move Serve related endpoints to dashboard agent (#26107)
In Ray 2.0, we want to achieve api server HA.
Originally serve endpoints are in head node.
This pr moves serve endpoints to dashboard agents, so they will be HA due to multiple replica of dashboard agent.
2022-07-06 10:58:00 -07:00
liuyang-my
a6ad48d778
[Serve] Java Client API and End to End Tests (#22726) 2022-07-05 21:19:18 -07:00
Jiao
89b0b82c13
[Deployment Graph] Move Deployment creation outside to build function (#26129) 2022-07-05 16:38:02 -07:00
Dmitri Gekhtman
34f1b32861
[K8s][Ray Operator] Ignore resource requests when detected container resources. (#26234)
When detecting resource capacities to advertise to Ray, the Ray operator takes into account requests. This doesn't make sense -- taking a min of resources and limits definitely doesn't make sense. Only limits should be considered.
2022-07-05 15:19:16 -07:00
Guyang Song
cf7305a2c9
Revert "[Core] Add retry exception allowlist for user-defined filteri… (#26289)
Closes #26287.
2022-07-05 15:17:36 -07:00
xwjiang2010
84279286df
[ci] pin gpustat (#26311) 2022-07-05 15:05:20 -07:00
Simon Mo
88a219c7f2
Revert "Revert "[AIR][Serve] Rename ModelWrapperDeployment -> PredictorDeployment"" (#26231) 2022-07-05 13:26:49 -07:00
xwjiang2010
b08a968b6b
[air] Do not warn of checkpoint_dir if it's coming from us (base_trainer). (#26259)
Currently, the following information will be printed even the user is not directly using a tune function. This is confusing and not actionable.

```
 "`checkpoint_dir` in `func(config, checkpoint_dir)` is "
                    "being deprecated. "
                    "To save and load checkpoint in trainable functions, "
                    "please use the `ray.air.session` API:\n\n"
                    "from ray.air import session\n\n"
                    "def train(config):\n"
                    "    # ...\n"
                    '    session.report({"metric": metric}, checkpoint=checkpoint)\n\n'
                    "For more information please see "
                    "https://docs.ray.io/en/master/ray-air/key-concepts.html#session\n"
```

The new logic check if `base_trainer` is in the call stack and only adds the warning message when it is not. The new logic will be removed once internally we migrate to use `session` API.
2022-07-03 20:29:15 -04:00
Cheng Su
11a24d6ef1
[Datasets] Support drop_columns API (#26200) 2022-07-03 14:41:54 -07:00
Cheng Su
7360452d2a
[Datasets] Fix max number of actors for default actor pool strategy (#26266) 2022-07-03 14:40:24 -07:00
Yi Cheng
096c0cd668
[core][gcs] Add storage namespace to redis storage in GCS. (#25994)
To enable one storage be able to be shared by multiple ray clusters, a special prefix is added to isolate the data between clusters: "<EXTERNAL_STORAGE_NAMESPACE>@"

The namespace is given by an os environment: `RAY_external_storage_namespace` when start the head: `RAY_external_storage_namespace=1234 ray start --head`

This flag is very important in HA GCS environment. For example, in ray serve operator, when the operator tries to bring up a new one, it's hard to just start a new db, but it's relatively easy to generate a new cluster id.
Another example is that, the user might only be able to maintain one HA Redis DB, and the namespace enable the user to start multiple ray clusters which share the same db.

This config should be moved to storage config in the future once we build that.
2022-07-03 11:16:37 -07:00
Siyuan (Ryans) Zhuang
5a094f1d18
[workflow] Deprecate workflow.create (#26106) 2022-07-02 21:24:05 -07:00
Dmitri Gekhtman
7d3ceb222c
[kuberay][autoscaler] Improve CPU, GPU, and memory detection. (#26219)
This PR improves the autoscaler's resource detection logic
2022-07-02 11:32:05 -07:00
VeronikaPolakova
18439af1bf
[Tune] Fix sort by metric (#25853)
Sort-by-metric working on metrics passed by tune.run

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-07-02 06:48:30 -07:00
Clark Zinzow
2a4d22fbd2
[Core] Add retry exception allowlist for user-defined filtering of retryable application-level errors. (#25896)
This PR adds supported for specifying an exception allowlist (List[Exception]) as the retry_exceptions argument, such that an application-level exception will only be retried if it is in the allowlist.
2022-07-01 20:06:02 -07:00
Stephanie Wang
68b893369c
[dataset] Support push-based shuffle in groupby operations (#25910)
Allows option for push-based shuffle in groupby operations, to improve scalability to larger Datasets.
2022-07-01 17:36:58 -07:00
Guyang Song
b9ade079cb
Revert "[runtime env] plugin refactor[2/n]: support json schema validation (#26154)" (#26246)
This reverts commit 122ec5e52f.
2022-07-01 15:48:03 +08:00
Siyuan (Ryans) Zhuang
ab44133fba
[Workflow] Replace StepID with TaskID (#26232) 2022-06-30 16:40:58 -07:00
shrekris-anyscale
010a3566e6
[Serve] Allow and remove trailing slashes in Ray submission address (#26093) 2022-06-30 16:04:53 -07:00
Kai Fricke
ce0cc8ea53
[tune] Improve custom func checkpointing example (#26230)
Avoid using internal constants in this example.
2022-06-30 15:53:12 -07:00
Eric Liang
3b1948ed45
[air] Randomize block order by default to avoid hotspots (#25870)
Enable block order randomization by default to avoid ingest hotspots when running concurrent trials.
2022-06-30 13:38:03 -07:00
xwjiang2010
ac831fded4
[air] update documentation to use session.report (#26051)
Update documentation to use `session.report`.

Next steps:
1. Update our internal caller to use `session.report`. Most importantly, CheckpointManager and DataParallelTrainer.
2. Update `get_trial_resources` to use PGF notions to incorporate the requirement of ResourceChangingScheduler. @Yard1 
3. After 2 is done, change all `tune.get_trial_resources` to `session.get_trial_resources`
4. [internal implementation] remove special checkpoint handling logic from huggingface trainer. Optimize the flow for checkpoint conversion with `session.report`.

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-06-30 10:37:31 -07:00
shrekris-anyscale
20c6c0725a
[Serve] Deprecate deployment's prev_version field (#26217) 2022-06-30 09:59:37 -07:00
xwjiang2010
3ffff53428
[tune] Fix stacktrace (#26220)
Reland the original change. But without changing the test_utils so that other tests are not affected...
2022-06-30 07:38:36 -07:00
ZhuSenlin
c5de057d1d
[Core][Enable gcs scheduler 3/n] integrate placement group with gcs scheduler (#24842)
## Why are these changes needed?
1. Now, bundle resources are deducted from the cluster resources on the `GCS` side when all Commit requests sent by `GCS` to `Raylet` are returned. Actually, the bundle resources should be deducted before sending `PreprareResources` by `GCS` to `Raylet`, so that the scheduling of actor based on `GCS` could use more fresh resources. BTW, putting the deduction before `PrepareResources` or after reply of all `CommitResources` has no impact on `Raylet` scheduling.

2. The `GcsResourceManager::UpdateResources` and `GcsResourceManager::DeleteResources` could be deleted to simplify `GcsResourceManager`.
   - `GcsResourceManager::UpdateResources` is only used when `GcsPlacementGroupScheduler::CommitAllBundles`, we could update the node resources (commit bundle resources) in `GcsPlacementGroupScheduler` directly, and I think it's unnecessary to put these resources to storage (the resources could be replayed by placement group)
   - `GcsResourceManager::DeleteResources` is only used when `GcsPlacementGroupScheduler::CancelResourceReserve` which is invoked by `GcsPlacementGroupScheduler::DestroyPlacementGroupPreparedBundleResources` and `GcsPlacementGroupScheduler::DestroyPlacementGroupCommittedBundleResources`. in fact, the `GcsPlacementGroupScheduler::ReturnBundleResources` will be called wherever these two functions are used, so I think the `GcsResourceManager::DeleteResources` is redundant. BTW, I think it's unnecessary to put the change of resources to storage (the resources could be replayed by placement group).

3.  The `gcs_table_storage_` is useless as both `GcsResourceManager::UpdateResources` and `GcsResourceManager::DeleteResources` is removed, so it could be removed too.

4. The `ray_gcs_new_resource_creation_latency_ms_sum` could be removed too as the `GcsResourceManager::UpdateResources` is removed.

Co-authored-by: 黑驰 <senlin.zsl@antgroup.com>
2022-06-30 02:04:39 -07:00
Guyang Song
122ec5e52f
[runtime env] plugin refactor[2/n]: support json schema validation (#26154) 2022-06-30 16:09:23 +08:00
Siyuan (Ryans) Zhuang
ddd63aba77
[workflow] Major refactoring - new async workflow executor (#25618)
* major workflow refactoring
2022-06-29 20:31:40 -07:00
Eric Liang
636a9c1291
[data] randomize_block_order() not compatible with stage fusion
Why are these changes needed?
Per the discussion in #26057, fix the stage fusion issue by re-ordering the randomize stage past any 1-1 stages.

Closes #26057
2022-06-29 18:16:03 -07:00
Stephanie Wang
1a8fd8a72b
Revert "[tune] fix stacktrace. (#26135)" (#26216)
This reverts commit e85247b5dd.
2022-06-29 17:00:31 -07:00
shrekris-anyscale
d1c9aaad33
[Serve] Set num_cpus to 0 in run_graph() task (#26177) 2022-06-29 16:35:33 -07:00
Dmitri Gekhtman
66ea76da1b
[kuberay] Logging-related autoscaler stability improvement.
The autoscaler container writes logs to a directory set up by the Ray container.
This PR moves the logic that sets up autoscaler logging so that it is done after the Ray container is ready.

This PR also changes things so that the autoscaler process exits after hitting 5 total exceptions. Kubernetes will then restart the autoscaler. The idea here is to ensure the autoscaler is able to restart cleanly in long-running deployments of Ray.
2022-06-29 13:18:13 -07:00
xwjiang2010
e85247b5dd
[tune] fix stacktrace. (#26135)
explicitly pass in `exc_info` to `logger.exception` when it's outside of try-catch blob.
2022-06-29 11:06:43 -07:00
Philipp Moritz
224ec2e45a
Add typing_extensions requirement to core requirements (#26169)
Since https://github.com/ray-project/ray/pull/25999 we need typing_extensions. It is a very light requirement (no transitive dependencies and small package) so that should be ok.

Considered alternative: Make it optional -- but that would make the typing code more brittle, and prevent us from using more typing in the future.
2022-06-29 09:37:02 -07:00