hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-04 17:41:43 -05:00

Author	SHA1	Message	Date
SangBin Cho	9950e9c1f4	[Doc] CLI Reference Documentation Revamp (#27862 ) Take out the CLI reference from the core API subsection. It follows the same CLI reference pattern as other library (e.g., Serve has Serve CLI under Serve API section).	2022-08-18 14:29:31 -07:00
Dmitri Gekhtman	c2ead88aca	[kuberay][docs] Experimental features (#27898 )	2022-08-18 11:37:06 -07:00
Dmitri Gekhtman	98c90b8488	[clusters][docs] Provide urls to content, fix typos (#27936 )	2022-08-18 11:33:04 -07:00
Dmitri Gekhtman	6cf263838f	[docs][touch-up] Add ephemeral storage to Ray-on-K8s example. (#27916 )	2022-08-18 11:29:55 -07:00
Sihan Wang	112f104fb6	[Serve][Doc] Fix user guide tables (#27991 )	2022-08-18 10:55:31 -07:00
Eric Liang	47f3d83379	[docs] Minor AIR figure updates (#27965 )	2022-08-18 10:30:24 -07:00
Jian Xiao	440ae620eb	Cap the number of stats kept in StatsActor and purge in FIFO order if the limit exceeded (#27964 ) There is a risk of using too much of memory in StatsActor, because its lifetime is the same as cluster lifetime. This puts a cap on how many stats to keep, and purge the stats in FIFO order if this cap is exceeded.	2022-08-18 10:25:31 -07:00
Cheng Su	24aeea8332	[Datasets] Add Cheng as code owner of data (#27912 ) Checked with team in Slack channel, did not see objection to add me as code owner. Signed-off-by: Cheng Su <scnju13@gmail.com>	2022-08-18 10:01:21 -07:00
Cheng Su	45e5e8c6ea	[Datasets] Customized serializer for Arrow JSON ParseOptions in read_json (#27911 ) This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821. Signed-off-by: Cheng Su <scnju13@gmail.com>	2022-08-18 10:00:56 -07:00
Simon Mo	6659971f95	[Serve][Java] Add Serve to Jar Building Process (#27976 ) So that they are available to be to be downloaded and installed on nightly	2022-08-17 23:06:14 -05:00
Jiajun Yao	7d981d6ced	Mark dataset_shuffle_push_based_random_shuffle_100tb as stable (#27963 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-17 15:05:15 -07:00
Jiajun Yao	0a3a5e68a4	Revamp ray core design patterns doc [2/n]: too fine grained tasks (#27919 ) Move the code to doc_code Fix the code example to make batching faster than serial run. Related issue number #27048 Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-17 13:52:50 -07:00
Charles Sun	edde905741	[RLlib] Add Decision Transformer (DT) (#27890 )	2022-08-17 13:49:13 -07:00
Chen Shen	6be4bf8be3	[hotfix] Fix pytest dependency in test_utils (#27956 ) import pytest in test_utils breaks a bunch of test.	2022-08-17 12:16:08 -07:00
Edward Oakes	a400cde56f	[docs][serve] Trim down user guide & clean up table of contents (#27926 ) An attempt at making the docs shorter and sweeter including various small cleanup items. - Reorder the TOC on the sidebar for the user guides to be more linear based on a user's journey. - Put the batching content under the performance guide. - Remove the AIR guide (AIR users already have a serving guide). - Combine the `ServeHandle` and model composition pages into a single guide. We may want to revisit this in the future but for now better to have it in a single place instead of duplicated (with links going to both). - Fix the index page for the user guides to match the TOC sidebar. - Rename a few pages for clarity & consistency. - Remove some now-redundant content (old ML models user guide).	2022-08-17 13:24:17 -05:00
Jian Xiao	2878119ece	Optimize groupby/mapgroups performance (#27805 ) For the following script, it took 75-90 mins to finish the groupby().map_groups() before, and with this PR it finishes in less than 10 seconds. The slowness came from the `get_boundaries` routine which linearly loop over each row in the Pandas DataFrame (note: there's just one block in the script below, which had multiple million rows). We make it 1) operate on numpy arrow, 2) use binary search and 3) use native impl of bsearch from numpy. ``` import argparse import time import ray import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from pyarrow import fs from pyarrow import dataset as ds from pyarrow import parquet as pq import pyarrow as pa import ray def transform_batch(df: pd.DataFrame): # Drop nulls. df['pickup_at'] = pd.to_datetime(df['pickup_at'], format='%Y-%m-%d %H:%M:%S') df['dropoff_at'] = pd.to_datetime(df['dropoff_at'], format='%Y-%m-%d %H:%M:%S') df['trip_duration'] = (df['dropoff_at'] - df['pickup_at']).dt.seconds df['pickup_location_id'].fillna(-1, inplace = True) df['dropoff_location_id'].fillna(-1, inplace = True) return df def train_test(rows): # if the group is too small, it cannot be split for train/test if len(rows.index) < 4: print(f"Dataframe for LocID: {rows.index} is empty") else: train, test = train_test_split(rows) train_X = train[["dropoff_location_id"]] train_y = train[['trip_duration']] test_X = test[["dropoff_location_id"]] test_y = test[['trip_duration']] reg = LinearRegression().fit(train_X, train_y) reg.score(train_X, train_y) pred_y = reg.predict(test_X) reg.score(test_X, test_y) error = np.mean(pred_y-test_y) # format output in dataframe (the same format as input) data = [[reg.coef_, reg.intercept_, error]] return pd.DataFrame(data, columns=["coef", "intercept", "error"]) start = time.time() rds = ray.data.read_parquet("s3://ursa-labs-taxi-data/2019/01/", columns=['pickup_at', 'dropoff_at', "pickup_location_id", "dropoff_location_id"]) rds = rds.map_batches(transform_batch, batch_format="pandas") grouped_ds = rds.groupby("pickup_location_id") results = grouped_ds.map_groups(train_test) taken = time.time() - start ```	2022-08-17 11:08:18 -07:00
Scott Graham	5567a38a70	Adding unique id to azure template to enable multiple clusters per resource group. Using unique id to set subnet random seed, change msi and vnet names, logging unique id, and adding it to filter vms in cluster. Example template files updated with comments. (#26392 ) Why are these changes needed? Adding support for deploying multiple clusters into the same azure resource group Changes: Adding unique_id field to provider section of yaml, if not provided one will be created based on hashing the resource group and cluster name. This will be appended to the name of all resources deployed to azure so they can co-exist in the same resource group (provided the cluster name is changed) Pulled in changes from [autoscaler] Enable creating multiple clusters in one resource group … #22997 to use cluster name when filtering vms to retrieve nodes only in the current cluster Added option to explicitly specify the subnet mask, otherwise use the resource group and cluster name as a seed and randomly choose a subnet to use for the vnet (to avoid collisions with other vnets) Updated yaml example files with new provider values and explanations Pulling resource_ids from initial azure-config-template deployment to pass into vm deployment to avoid matching hard-coded resource names across templates Related issue number Closes #22996 Supersedes #22997 Signed-off-by: Scott Graham <scgraham@microsoft.com> Signed-off-by: Scott Graham <scgraham@microsoft.com> Co-authored-by: Scott Graham <scgraham@microsoft.com>	2022-08-17 09:24:26 -07:00
Artur Niederfahrenhorst	f7b4c5a7ec	[RLlib] Remove unneeded args from offline learning examples. (#26666 )	2022-08-17 17:59:27 +02:00
Kai Fricke	4a55f18a22	[docs][serve] Fix linkcheck for production guide (#27941 )	2022-08-17 07:46:53 -07:00
Antoni Baum	d449f8db27	[CI] Update upstream requirements for XGB/LGBM-Ray (#27908 ) To include these in the latest docker images (and get rid of deprecation warnings), bump in requirements_upstream.txt. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-08-17 10:55:02 +02:00
Ricky Xu	68b5d4302c	[Core] Suppress gRPC server alerting on too many keep-alive pings (#27769 ) # Why are these changes needed? (map pid=516, ip=172.31.64.223) E0526 12:32:19.203322360 675 chttp2_transport.cc:1103] Received a GOAWAY with error code ENHANCE_YOUR_CALM and debug data equal to "too_many_pings". See [this](https://github.com/ray-project/ray/issues/25367#issuecomment-1189421372) for more details. We currently see this in many of the large nightly tests. # Root Cause The root cause (with pretty high confidence level) has been some misconfigs between gRPC server/clients. Essentially the client is pinging the server too frequently for keep-alive heartbeats. # Mitigation This PR is merely a mitigation step. I will keep looking into the exact client/server pair later, but probably don't have bandwidth for now largely because the test iteration takes quite a while and verbose logging with gRPC and ray backend have not revealed much useful info. This only kicks in at the end of a long running map phase, and verbose logging doesn't tell me which client is sending the pings.	2022-08-17 01:53:47 -07:00
Charles Sun	9330d8f244	[RLlib] Add DTTorchPolicy (#27889 )	2022-08-17 00:28:00 -07:00
Nikita Vemuri	4692e8d802	[core] Don't override external dashboard URL in internal KV store (#27901 ) Fix 2.0.0 release blocker bug where Ray State API and Jobs not accessible if the override URL doesn't support adding additional subpaths. This PR keeps the localhost dashboard URL in the internal KV store and only overrides in values printed or returned to the user. images.githubusercontent.com/6900234/184809934-8d150874-90fe-4b45-a13d-bce1807047de.png">	2022-08-16 22:48:05 -07:00
Cheng Su	4ad1b4c712	Fix nyc_taxi_basic_processing.ipynb end-to-end (#27927 ) Signed-off-by: Cheng Su <scnju13@gmail.com> This is to run ray 2.0.0rc0 on https://docs.ray.io/en/master/data/examples/nyc_taxi_basic_processing.html and fix the notebook end-to-end, make sure the output and wording is matched. The page after this PR - https://ray--27927.org.readthedocs.build/en/27927/data/examples/nyc_taxi_basic_processing.html .	2022-08-16 21:30:19 -07:00
Christy Bergman	3f313d74ad	Replace robot image with emoji and replace word Trainer with Algorithm (#27928 )	2022-08-16 21:27:21 -07:00
Edward Oakes	65f92a44e3	[serve][docs] Consolidate production guides, add kuberay docs to it (#27747 ) - Adds KubeRay information to the production guide. - Consolidates the two user guides we had related to production deployment. - Adds information about experimental GCS HA feature.	2022-08-16 21:29:56 -05:00
Yi Cheng	2262ac02f3	[workflow][doc] First pass of workflow doc. (#27331 ) Signed-off-by: Yi Cheng 74173148+iycheng@users.noreply.github.com Why are these changes needed? This PR update workflow doc to reflect the recent change. Focusing on position change and others.	2022-08-16 18:48:05 -07:00
Charles Sun	61880591e9	[RLlib] Add DTTorchModel (#27872 )	2022-08-16 18:18:29 -07:00
Yi Cheng	87ce8480ff	[core] Add stats for the gcs backend for telemetry. (#27876 ) ## Why are these changes needed? To get better understanding of how GCS FT is used, adding this metrics. Test: ``` cat /tmp/ray/session_latest/usage_stats.json {"usage_stats": {"ray_version": "3.0.0.dev0", "python_version": "3.9.12", "schema_version": "0.1", "source": "OSS", "session_id": "70d3ecd3-5b16-40c3-9301-fd05404ea92a", "git_commit": "{{RAY_COMMIT_SHA}}", "os": "linux", "collect_timestamp_ms": 1660587366806, "session_start_timestamp_ms": 1660587351586, "cloud_provider": null, "min_workers": null, "max_workers": null, "head_node_instance_type": null, "worker_node_instance_types": null, "total_num_cpus": 16, "total_num_gpus": null, "total_memory_gb": 16.10752945020795, "total_object_store_memory_gb": 8.053764724172652, "library_usages": ["serve"], "total_success": 0, "total_failed": 13, "seq_number": 13, "extra_usage_tags": {"serve_api_version": "v1", "gcs_storage": "redis", "serve_num_deployments": "1"}, "total_num_nodes": 2, "total_num_running_jobs": 2}} ```	2022-08-16 17:02:04 -07:00
Antoni Baum	7ff914b06e	[AIR][Docs] Set `logging_strategy="epoch"` for HF (#27917 )	2022-08-16 16:45:46 -07:00
Charles Sun	753fad9cad	[RLlib] Add Segmentation Buffer for DT (#27829 )	2022-08-16 15:20:41 -07:00
Eric Liang	8a7be15b72	[docs] Simplify Ray start guide and move PI tutorial to examples page (#27885 )	2022-08-16 14:28:45 -07:00
SangBin Cho	75051278d7	Fix the undocumented ray log error (#27887 ) Looks like hidden=True commands cannot be documented on sphinx. I removed the add_alias and use the standard click API to rename the API from the name of the method	2022-08-16 14:28:09 -07:00
Richard Liaw	759fbd9502	[air][minor] Use drop_columns in docs (#27852 )	2022-08-16 14:01:25 -07:00
Zoltan Fedor	78648e3583	[Serve][Docs] Mark metrics served for HTTP vs Python calls (#27858 ) Different metrics are collected in Ray Serve when the deployments are called from HTTP vs Python. This needs to be mentioned in the documentation and each metric marked accordingly.	2022-08-16 15:23:29 -05:00
Ian Rodney	24508db920	[Docs][GCP] Configuring ServiceAccounts for worker (#27915 ) Enables better usage with GCP. The default behavior is that the head runs with the ray-autoscaler-sa-v1 service Account, but workers do not. Workers can run with this service account by copying & uncommenting L114->L117 from example-full Signed-off-by: Ian <ian.rodney@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu>	2022-08-16 13:13:27 -07:00
Jiajun Yao	c5a4605030	Fix grammer of error message (#27900 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-16 11:26:03 -07:00
Antoni Baum	5757909cd2	[AIR] `load_best_model_at_end` validation for HF (#27875 ) Adds validation for TrainingArguments.load_best_model_at_end (will throw an error down the line if set to True), fixes validation for *_steps, adds test. Signed-off-by: Antoni Baum <antoni.baum@protonmail.com>	2022-08-16 10:52:05 -07:00
Kai Fricke	b91246a093	[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902 ) We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded. By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-16 19:16:08 +02:00
Simon Mo	b9a2fb79b6	[AIR][Docs] Remove the excessive printing from Torch examples (#27903 )	2022-08-16 09:09:54 -07:00
Peyton Murray	4d19c0222b	[AIR] Add rich notebook repr for DataParallelTrainer (#26335 )	2022-08-16 08:51:14 -07:00
Dmitri Gekhtman	bceef503b2	[Kubernetes][docs] Restore legacy Ray operator migration discussion (#27841 ) This PR restores notes for migration from the legacy Ray operator to the new KubeRay operator. To avoid disrupting the flow of the Ray documentation, these notes are placed in a README accompanying the old operator's code. These notes are linked from the new docs. Signed-off-by: Dmitri Gekhtman <dmitri.m.gekhtman@gmail.com>	2022-08-16 08:46:31 -07:00
xwjiang2010	91f506304d	[air] [checkpoint manager] handle nested metrics properly as scoring attribute. (#27715 ) handle nested metrics properly as scoring attribute Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-16 17:43:58 +02:00
Sven Mika	436c89ba1a	[RLlib] Eval workers use async req manager. (#27390 )	2022-08-16 12:05:55 +02:00
liuyang-my	1c4b3879a1	[Serve]Fix classloader bug in Java Deployment (#27899 ) We have encountered `java.lang.ClassNotFoundException` when deploying Java Ray Serve deployments. The property `ray.job.code-search-path` which specifies the search path of user's classes is not working. The reason is that `ray.job.code-search-path` is loaded in an independent classloader in Ray context, but Serve Replica initialized user class with `AppClassLoader`. We need to change the classloader used to construct user classes to the one in Ray context.	2022-08-16 15:22:00 +08:00
Alex Wu	c2abfdb2f7	[autoscaler][observability] Observability into when/why nodes fail to launch (#27697 ) This change adds launch failures to the recent failures section of ray status when a node provider provides structured error information. For node providers which don't provide this optional information, there is now change in behavior. For reference, when trying to launch a node type with a quota issue, it looks like the following. InsufficientInstanceCapacity is the standard term for this issue.. ``` ======== Autoscaler status: 2022-08-11 22:22:10.735647 ======== Node status --------------------------------------------------------------- Healthy: 1 cpu_4_ondemand Pending: quota, 1 launching Recent failures: quota: InsufficientInstanceCapacity (last_attempt: 22:22:00) Resources --------------------------------------------------------------- Usage: 0.0/4.0 CPU 0.00/9.079 GiB memory 0.00/4.539 GiB object_store_memory Demands: (no resource demands) ``` ``` available_node_types: cpu_4_ondemand: node_config: InstanceType: m4.xlarge ImageId: latest_dlami resources: {} min_workers: 0 max_workers: 0 quota: node_config: InstanceType: p4d.24xlarge ImageId: latest_dlami resources: {} min_workers: 1 max_workers: 1 ``` Co-authored-by: Alex <alex@anyscale.com>	2022-08-15 18:14:29 -07:00
Chen Shen	f05c744a65	[Doc] minor fix on accessing AWS/S3 update the doc.	2022-08-15 16:53:31 -07:00
xwjiang2010	a3236b6225	[air] fix ptl release test (#27773 ) Signed-off-by: xwjiang2010 xwjiang2010@gmail.com	2022-08-15 14:47:33 -07:00
Yuan-Chi Chang	34c494260f	[workflow] Documentation of http events (#27166 ) Documentation updates for the newly introduced HTTPEventProvider and HTTPListener in Ray 2.0.	2022-08-15 14:23:04 -07:00
Jiajun Yao	06ef4ab94e	Fix broken links in the code (#27873 ) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>	2022-08-15 13:11:42 -07:00

1 2 3 4 5 ...

14133 commits