Commit graph

14014 commits

Author SHA1 Message Date
Antoni Baum
d449f8db27
[CI] Update upstream requirements for XGB/LGBM-Ray (#27908)
To include these in the latest docker images (and get rid of deprecation warnings), bump in requirements_upstream.txt.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-08-17 10:55:02 +02:00
Ricky Xu
68b5d4302c
[Core] Suppress gRPC server alerting on too many keep-alive pings (#27769)
# Why are these changes needed?
(map pid=516, ip=172.31.64.223) E0526 12:32:19.203322360     675 chttp2_transport.cc:1103]   Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". See [this](https://github.com/ray-project/ray/issues/25367#issuecomment-1189421372) for more details. 
We currently see this in many of the large nightly tests.

# Root Cause
The root cause (with pretty high confidence level) has been some misconfigs between gRPC server/clients. Essentially the client is pinging the server too frequently for keep-alive heartbeats.

# Mitigation
This PR is merely a mitigation step. I will keep looking into the exact client/server pair later, but probably don't have bandwidth for now largely because the test iteration takes quite a while and verbose logging with gRPC and ray backend have not revealed much useful info. This only kicks in at the end of a long running map phase, and verbose logging doesn't tell me which client is sending the pings.
2022-08-17 01:53:47 -07:00
Charles Sun
9330d8f244
[RLlib] Add DTTorchPolicy (#27889) 2022-08-17 00:28:00 -07:00
Nikita Vemuri
4692e8d802
[core] Don't override external dashboard URL in internal KV store (#27901)
Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user.
images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">
2022-08-16 22:48:05 -07:00
Cheng Su
4ad1b4c712
Fix nyc_taxi_basic_processing.ipynb end-to-end (#27927)
Signed-off-by: Cheng Su <scnju13@gmail.com>
This is to run ray 2.0.0rc0 on https://docs.ray.io/en/master/data/examples/nyc_taxi_basic_processing.html and fix the notebook end-to-end, make sure the output and wording is matched.

The page after this PR - https://ray--27927.org.readthedocs.build/en/27927/data/examples/nyc_taxi_basic_processing.html .
2022-08-16 21:30:19 -07:00
Christy Bergman
3f313d74ad
Replace robot image with emoji and replace word Trainer with Algorithm (#27928) 2022-08-16 21:27:21 -07:00
Edward Oakes
65f92a44e3
[serve][docs] Consolidate production guides, add kuberay docs to it (#27747)
- Adds KubeRay information to the production guide.
- Consolidates the two user guides we had related to production deployment.
- Adds information about experimental GCS HA feature.
2022-08-16 21:29:56 -05:00
Yi Cheng
2262ac02f3
[workflow][doc] First pass of workflow doc. (#27331)
Signed-off-by: Yi Cheng 74173148+iycheng@users.noreply.github.com

Why are these changes needed?
This PR update workflow doc to reflect the recent change.
Focusing on position change and others.
2022-08-16 18:48:05 -07:00
Charles Sun
61880591e9
[RLlib] Add DTTorchModel (#27872) 2022-08-16 18:18:29 -07:00
Yi Cheng
87ce8480ff
[core] Add stats for the gcs backend for telemetry. (#27876)
## Why are these changes needed?

To get better understanding of how GCS FT is used, adding this metrics.

Test:
```
cat /tmp/ray/session_latest/usage_stats.json
{"usage_stats": {"ray_version": "3.0.0.dev0", "python_version": "3.9.12", "schema_version": "0.1", "source": "OSS", "session_id": "70d3ecd3-5b16-40c3-9301-fd05404ea92a", "git_commit": "{{RAY_COMMIT_SHA}}", "os": "linux", "collect_timestamp_ms": 1660587366806, "session_start_timestamp_ms": 1660587351586, "cloud_provider": null, "min_workers": null, "max_workers": null, "head_node_instance_type": null, "worker_node_instance_types": null, "total_num_cpus": 16, "total_num_gpus": null, "total_memory_gb": 16.10752945020795, "total_object_store_memory_gb": 8.053764724172652, "library_usages": ["serve"], "total_success": 0, "total_failed": 13, "seq_number": 13, "extra_usage_tags": {"serve_api_version": "v1", "gcs_storage": "redis", "serve_num_deployments": "1"}, "total_num_nodes": 2, "total_num_running_jobs": 2}}
```
2022-08-16 17:02:04 -07:00
Antoni Baum
7ff914b06e
[AIR][Docs] Set logging_strategy="epoch" for HF (#27917) 2022-08-16 16:45:46 -07:00
Charles Sun
753fad9cad
[RLlib] Add Segmentation Buffer for DT (#27829) 2022-08-16 15:20:41 -07:00
Eric Liang
8a7be15b72
[docs] Simplify Ray start guide and move PI tutorial to examples page (#27885) 2022-08-16 14:28:45 -07:00
SangBin Cho
75051278d7
Fix the undocumented ray log error (#27887)
Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method
2022-08-16 14:28:09 -07:00
Richard Liaw
759fbd9502
[air][minor] Use drop_columns in docs (#27852) 2022-08-16 14:01:25 -07:00
Zoltan Fedor
78648e3583
[Serve][Docs] Mark metrics served for HTTP vs Python calls (#27858)
Different metrics are collected in Ray Serve when the deployments are called from HTTP vs Python. This needs to be mentioned in the documentation and each metric marked accordingly.
2022-08-16 15:23:29 -05:00
Ian Rodney
24508db920
[Docs][GCP] Configuring ServiceAccounts for worker (#27915)
Enables better usage with GCP.

The default behavior is that the head runs with the ray-autoscaler-sa-v1 service Account, but workers do not. Workers can run with this service account by copying & uncommenting L114->L117 from example-full


Signed-off-by: Ian <ian.rodney@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-16 13:13:27 -07:00
Jiajun Yao
c5a4605030
Fix grammer of error message (#27900)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-16 11:26:03 -07:00
Antoni Baum
5757909cd2
[AIR] load_best_model_at_end validation for HF (#27875)
Adds validation for TrainingArguments.load_best_model_at_end (will throw an error down the line if set to True), fixes validation for *_steps, adds test.

Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>
2022-08-16 10:52:05 -07:00
Kai Fricke
b91246a093
[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902)
We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded.
By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch.

Signed-off-by: Kai Fricke <kai@anyscale.com>
2022-08-16 19:16:08 +02:00
Simon Mo
b9a2fb79b6
[AIR][Docs] Remove the excessive printing from Torch examples (#27903) 2022-08-16 09:09:54 -07:00
Peyton Murray
4d19c0222b
[AIR] Add rich notebook repr for DataParallelTrainer (#26335) 2022-08-16 08:51:14 -07:00
Dmitri Gekhtman
bceef503b2
[Kubernetes][docs] Restore legacy Ray operator migration discussion (#27841)
This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator.

To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code.

These notes are linked from the new docs.

Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>
2022-08-16 08:46:31 -07:00
xwjiang2010
91f506304d
[air] [checkpoint manager] handle nested metrics properly as scoring attribute. (#27715)
handle nested metrics properly as scoring attribute

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-16 17:43:58 +02:00
Sven Mika
436c89ba1a
[RLlib] Eval workers use async req manager. (#27390) 2022-08-16 12:05:55 +02:00
liuyang-my
1c4b3879a1
[Serve]Fix classloader bug in Java Deployment (#27899)
We have encountered `java.lang.ClassNotFoundException` when deploying Java Ray Serve deployments. The property `ray.job.code-search-path` which specifies the search path of user's classes is not working. The reason is that `ray.job.code-search-path` is loaded in an independent classloader in Ray context, but Serve Replica initialized user class with `AppClassLoader`. We need to change the classloader used to construct user classes to the one in Ray context.
2022-08-16 15:22:00 +08:00
Alex Wu
c2abfdb2f7
[autoscaler][observability] Observability into when/why nodes fail to launch (#27697)
This change adds launch failures to the recent failures section of ray status when a node provider provides structured error information. For node providers which don't provide this optional information, there is now change in behavior.

For reference, when trying to launch a node type with a quota issue, it looks like the following. InsufficientInstanceCapacity is the standard term for this issue..

```
======== Autoscaler status: 2022-08-11 22:22:10.735647 ========
Node status
---------------------------------------------------------------
Healthy:
 1 cpu_4_ondemand
Pending:
 quota, 1 launching
Recent failures:
 quota: InsufficientInstanceCapacity (last_attempt: 22:22:00)

Resources
---------------------------------------------------------------
Usage:
 0.0/4.0 CPU
 0.00/9.079 GiB memory
 0.00/4.539 GiB object_store_memory

Demands:
 (no resource demands)
```

```
available_node_types:
    cpu_4_ondemand:
        node_config:
            InstanceType: m4.xlarge
            ImageId: latest_dlami
        resources: {}
        min_workers: 0
        max_workers: 0
    quota:
        node_config:
            InstanceType: p4d.24xlarge
            ImageId: latest_dlami
        resources: {}
        min_workers: 1
        max_workers: 1
```
Co-authored-by: Alex <alex@anyscale.com>
2022-08-15 18:14:29 -07:00
Chen Shen
f05c744a65
[Doc] minor fix on accessing AWS/S3
update the doc.
2022-08-15 16:53:31 -07:00
xwjiang2010
a3236b6225
[air] fix ptl release test (#27773)
Signed-off-by: xwjiang2010 xwjiang2010@gmail.com
2022-08-15 14:47:33 -07:00
Yuan-Chi Chang
34c494260f
[workflow] Documentation of http events (#27166)
Documentation updates for the newly introduced HTTPEventProvider and HTTPListener in Ray 2.0.
2022-08-15 14:23:04 -07:00
Jiajun Yao
06ef4ab94e
Fix broken links in the code (#27873)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-15 13:11:42 -07:00
xwjiang2010
68cc544da6
[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869)
Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 19:03:40 +02:00
Archit Kulkarni
058c239cf1
[runtime env] Test common failure scenarios (#25977)
Tests the following failure scenarios:
- Fail to upload data in `ray.init()` (`working_dir`, `py_modules`)
- Eager install fails in `ray.init()` for some other reason (bad `pip` package)
- Fail to download data from GCS (`working_dir`)

Improves the following error message cases:
- Return RuntimeEnvSetupError on failure to upload working_dir or py_modules
- Return RuntimeEnvSetupError on failure to download files from GCS during runtime env setup

Not covered in this PR:
- RPC to agent fails (This is extremely rare because the Raylet and agent are on the same node.)
- Agent is not started or dead (We don't need to worry about this because the Raylet fate shares with the agent.)

The approach is to use environment variables to induce failures in various places.  The alternative would be to refactor the packaging code to use dependency injection for the Internal KV client so that we can pass in a fake. I'm not sure how much of an improvement this would be.  I think we'd still have to set an environment variable to pass in the fake client, because these are essentially e2e tests of `ray.init()` and we don't have an API to pass it in.
2022-08-15 11:35:56 -05:00
Jiajun Yao
eb37bb857c
Revamp ray core design patterns doc [1/n]: generators (#27823)
- Move the code snippet to doc_code folder
- Move patterns to an upper level.

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
2022-08-15 09:24:34 -07:00
Myeongju Kim
52440f1489
[Docs] Fix a typo in index.md (#27859)
Signed-off-by: myeongjukim <ming3772@gmail.com>

Signed-off-by: myeongjukim <ming3772@gmail.com>
2022-08-15 08:26:40 -07:00
xwjiang2010
f77ec350fa
[release test] remove dask/modin_xgboost test completely. (#27865)
The original script was removed in https://github.com/ray-project/ray/pull/27816
This is just to clean up some remainings.

Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>
2022-08-15 16:55:33 +02:00
SangBin Cho
d654636bfc
[Test] Fix broken test_base_trainer (#27855)
The test was written incorrectly. This root cause was that the trainer & worker both requires 1 CPU, meaning pg requires {CPU: 1} * 2 resources.

And when the max fraction is 0.001, we only allow up to 1 CPU for pg, so we cannot schedule the requested pgs in any case.
2022-08-15 07:50:18 -07:00
Cheng Su
a2c168cd6d
[Datasets][docs] Minor fix for nyc_taxi_basic_processing.ipynb (#27828)
Went through https://docs.ray.io/en/master/data/examples/nyc_taxi_basic_processing.html, and doing some minor fix here.

Fix the size_bytes() result (before this PR it was using Parquet sampling, but we disasble it later)
Change one size_bytes() call to count() call as it was meant to use count() with followed wording That’s a lot of rows in doc.
Changed places are as followed in screenshots:
2022-08-14 12:34:33 -07:00
SangBin Cho
9ece110d27
[State Observability] Promote the API to alpha (#27788)
# Why are these changes needed?

- Promote APIs to PublicAPI(alpha)
- Change pre-alpha -> alpha
- Fix a bug ray_logs is displayed to ray --help

Release test result: #26610
Some APIs are subject to change at the beta stage (e.g., ray list jobs or ray logs).
2022-08-13 23:43:01 -07:00
Eric Liang
1bae3c905d
Improve text around ecosystem map (#27839)
Signed-off-by: Eric Liang <ekhliang@gmail.com>

Signed-off-by: Eric Liang <ekhliang@gmail.com>
2022-08-12 18:59:09 -07:00
SangBin Cho
999715ebec
[Core][Placement Group] Handling edge cases of max_cpu_fraction argument (#27035)
Why are these changes needed?
This PR fixes the edge cases when the max_cpu_fraction argument is used by the placement group. There was specifically an edge case where the placement group cannot be scheduled when a task or actor is scheduled and occupies the resource.

The original logic to decide if the bundle scheduling exceed CPU fraction was as follow.

Calculate max_reservable_cpus of the node.
Calculate currently_used_cpus + bundle_cpu_request (per bundle) == total_allocation of the node.
Don't schedule if total_allocation > max_reservable_cpus for the node.
However, the following scenarios caused issues because currently_used_cpus can include resources that are not allocated by placement groups (e.g., actors). As a result, when the actor was already occupying the resource, the total_allocation was incorrect. For example,

4 CPUs
0.999 max fraction (so it can reserve up to 3 CPUs)
1 Actor already created (1 CPU)
PG with CPU: 3
Now pg cannot be scheduled because total_allocation == 1 actor (1 CPU) + 3 bundles (3 CPUs) == 4 CPUs > 3 CPUs (max frac CPUs)
However, this should work because the pg can use up to 3 CPUs, and we have enough resources.
The root cause is that when we calculate the max_fraction, we should only take into account of resources allocated by bundles. To fix this, I change the logic as follow.

Calculate max_reservable_cpus of the node.
Calculate **currently_used_cpus_by_pg_bundles** + **bundle_cpu_request (sum of all bundles)** == total_allocation_from_pgs_and_bundles of the node.
Don't schedule if total_allocation_from_pgs_and_bundles > max_reservable_cpus for the node.
2022-08-12 17:40:11 -07:00
Sihan Wang
7e7c93f6ba
[Serve] Fix memory leak issue in serve inference (#27815) 2022-08-12 17:11:37 -07:00
shrekris-anyscale
f9d5d6df12
[Serve] [Docs] Revise Java API documentation (#27831) 2022-08-12 17:09:40 -07:00
shrekris-anyscale
0a3c1de08b
[Serve] [Docs] Replace references to dag.execute() with handle.predict.remote() (#27784) 2022-08-12 17:09:28 -07:00
zcin
8cb09a9fc5
Revert "Revert "[serve] Integrate and Document Bring-Your-Own Gradio Applications"" (#27662) 2022-08-12 15:12:20 -07:00
Balaji Veeramani
55e57d4f92
[AIR] [Docs] Revise "Which preprocessor should you use?" (#27835)
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
2022-08-12 14:43:36 -07:00
zcin
fa37ddc584
[Serve][docs] Add type annotations to code samples (#27795) 2022-08-12 13:41:08 -07:00
Simon Mo
192d92bb77
[Serve] [Doc] Update intro page (#27735) 2022-08-12 11:37:18 -07:00
Archit Kulkarni
ba36365c32
[Doc] [Serve] Fix LINT by fixing outdated Ray Client doc link (#27826) 2022-08-12 11:35:15 -07:00
Clark Zinzow
f0404e00cd
[Core] [Hotfix] Change "task failed with unretryable exception" log statement to debug-level. (#27714)
Serve relies on being able to do quiet application-level retries, and this info-level logging is resulting in log spam hitting users. This PR demotes this log statement to debug-level to prevent this log spam.

Co-authored-by: simon-mo <simon.mo@hey.com>
2022-08-12 11:28:49 -07:00