hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Kai Fricke	b91246a093	[air/benchmarks] Measure local training time in torch/tf benchmarks (#27902 ) We currently measure end-to-end training time in our benchmarks, which includes setup overhead. This is an unequal comparison, as setup overhead for vanilla training cannot be accurately expressed and was instead just disregarded. By comparing the raw training times in the actual training loop, we will get a more accurate expression of any potential overhead or benefit in using Ray vs. vanilla tensorflow/torch. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-16 19:16:08 +02:00
xwjiang2010	a3236b6225	[air] fix ptl release test (#27773 ) Signed-off-by: xwjiang2010 xwjiang2010@gmail.com	2022-08-15 14:47:33 -07:00
xwjiang2010	68cc544da6	[release test] increase air tf gpu benchmark non smoke test timeout from 3600 to 4800. (#27869 ) Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 19:03:40 +02:00
xwjiang2010	f77ec350fa	[release test] remove dask/modin_xgboost test completely. (#27865 ) The original script was removed in https://github.com/ray-project/ray/pull/27816 This is just to clean up some remainings. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-15 16:55:33 +02:00
Jian Xiao	5a18b1fc45	Spread the actors in data ingest benchmark, which 2x the throughput (#27620 ) The consuming actors were not spread and this PR fixed it, which improved throughput by 2x.	2022-08-11 11:47:54 -07:00
Ricky Xu	5ea4747448	[Core][State Observability] Nightly release test for state API (#26610 ) * Initial * Correctness test skeleton * Added limit for listing * Updated grpc config * no more waiting * metrics * Updated constant and add test * renamed * actors * actors * actors * dada * actor dead? * Script * correct test name * limit * Added timeout * release test /2 * Merged * format+doc * wip Signed-off-by: rickyyx <ricky@anyscale.com> * revert packag-lock Signed-off-by: rickyyx <rickyx@anyscale.com> * wip * results Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: rickyyx <rickyx@anyscale.com> Signed-off-by: rickyyx <ricky@anyscale.com> Signed-off-by: rickyx <rickyx@anyscale.com> Co-authored-by: rickyyx <ricky@anyscale.com>	2022-08-11 07:01:01 -07:00
Artur Niederfahrenhorst	0dceddb912	[RLlib] Move learning_starts logic from buffers into `training_step()`. (#26032 )	2022-08-11 13:07:30 +02:00
matthewdeng	8eca6ae852	[rllib][release] mark long_running_many_ppo as unstable (#26874 ) Per #26718 (comment)	2022-08-10 17:58:33 -07:00
Avnish Narayan	aee008ab49	[RLlib] PPO release tests tuned and re-enabled. (#27564 )	2022-08-08 21:04:19 +02:00
Jian Xiao	30cf449807	Add data ingest benchmark (#27533 ) Make sure Dataset/DatasetPipeline work performantly for data ingestion.	2022-08-05 12:31:06 -07:00
Avnish Narayan	6a31b61580	[RLlib] CQL change hparams and data reading strategy (#27451 )	2022-08-04 18:55:32 -07:00
Avnish Narayan	55209692ee	[RLlib] Deflake MARWIL and BC and remove memory leak from torch MARWIL policy (#27406 )	2022-08-03 16:53:12 -07:00
Jimmy Yao	1c1cca2736	[release/ray-lightning] adjust the release test of ray lightning master First of all, sorry i messed up with the previous pr when sync with the master (#27374). This PR is the duplicate of previous pr until we update the changes (change: adding the version check for the ray_lightning for the compatibility). Also, apology for the massive review requests on the previous PR.	2022-08-03 16:01:32 +01:00
Simon Mo	8ac6d02502	[Serve][Nightly] Environment for Nightly K8s Tests (#27126 )	2022-08-02 23:05:47 -07:00
kourosh hakhamaneshi	bda5026428	[RLlib] Fix A2C release tests (#27314 )	2022-08-02 10:44:52 -07:00
Kai Fricke	d527c7b335	[air/benchmarks] Drop OMP_NUM_THREADS in vanilla torch/tf training (#27256 ) Ray automatically sets OMP_NUM_THREADS=1, potentially limiting multithreading in native pytorch/tensorflow. If this leads to performance differences, we should address this either in Ray Train or in Ray core. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-02 13:38:01 +01:00
Kai Fricke	149c031c4b	[tune/release] Do not use spot instances in k8s tests (#27250 ) Spot instances are not being booted up, so let's go without them. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-08-02 11:30:41 +01:00
xwjiang2010	c9579fea1c	[air] update pytorch_training_e2e.py to use iter_torch_batches. (#27241 ) update pytorch_training_e2e.py to use iter_torch_batches. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-08-01 19:23:01 +01:00
Dmitri Gekhtman	8bdeb30510	[docs][ml][kuberay] Add a --disable-check flag to the XGBoost benchmark. (#27277 ) This PR adds a flag --disable-check to the XGBoost benchmark script which disables the RuntimeError that comes up if training or prediction took too long. This is meant for non-CI exploratory use-cases. Specifically, the reason is this: We will include the XGBoost benchmark as an example workload for the KubeRay documentation. The actual performance of the workload is highly sensitive to infrastructure environment, so we won't want to raise an alarming RuntimeError if the workload took too long on the user's infrastructure. (When I tried the 100Gb benchmark on KubeRay, training ran just a couple of minutes longer than the 1000 second cutoff.)	2022-07-29 14:31:10 -07:00
Jun Gong	e6e10ce4cf	[RLlib] Revert `41c9ef70`. (#27243 ) Why are these changes needed? Also: Add validation to make sure multi-gpu and micro-batch is not used together. Update A2C learning test to hit the microbatching branch. Minor comment updates.	2022-07-29 11:05:15 -07:00
Kai Fricke	ee05fc94fe	[tune] Increase volume size for long running pbt failure (#27163 ) Currently running into an issue: Cluster startup Failed. Error: RuntimeError: botocore.exceptions.ClientError: An error occurred (InvalidBlockDeviceMapping) when calling the RunInstances operation: Volume of size 202GB is smaller than snapshot 'snap-02c4e6a0ad06cf3d6', expect size >= 400GB	2022-07-28 22:57:26 -07:00
Clark Zinzow	3730ec8cc9	[AIR - Datasets] Fix AIR release tests dealing with tensor columns. (#27221 ) This PR fixes some AIR release tests that deal with tensor columns.	2022-07-28 14:34:11 -07:00
Simon Mo	8beb887bbe	[Serve] Remove release tests for checkpoint_path (#27194 )	2022-07-28 12:30:30 -07:00
Kai Fricke	3cd9a0446b	[tune/rllib/release] Load correct metadata file in rllib cloud tests (#27164 ) Currently this tries to load a stale metadata file that doesn't exist anymore after internal refactoring. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-28 15:51:09 +01:00
Kai Fricke	1d3c167bfe	[rllib/release] Fix rllib connect test with Tuner() API (#27155 ) Currently failing because the Tune framework example does not return fitting results. Signed-off-by: Kai Fricke <kai@anyscale.com>	2022-07-28 11:08:02 +01:00
matthewdeng	0319dcd889	[air] fix xgboost_benchmark script by passing in args (#27146 )	2022-07-27 19:08:15 -07:00
xwjiang2010	eb69c1ca28	[air] Add annotation for Tune module. (#27060 ) Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 13:53:46 -07:00
Malinda	1d789aee63	[RLlib/Serve/Release tests] Few code refactoring for better use of efficient NumPy functions. (#26284 )	2022-07-27 22:38:35 +02:00
Simon Mo	e5a8b1dd55	[Serve] Add API Annotations And Move to _private (#27058 )	2022-07-27 09:08:26 -07:00
SangBin Cho	a6fe2c1e87	[Release test] Add a memory monitor to nightly test long running actor death (#27083 ) Add a memory monitor to nightly test long running actor death. It will be used to see memory leak from the test	2022-07-27 07:32:10 -07:00
Amog Kamsetty	862d10c162	[AIR] Remove ML code from `ray.util` (#27005 ) Removes all ML related code from `ray.util` Removes: - `ray.util.xgboost` - `ray.util.lightgbm` - `ray.util.horovod` - `ray.util.ray_lightning` Moves `ray.util.ml_utils` to other locations Closes #23900 Signed-off-by: Amog Kamsetty <amogkamsetty@yahoo.com> Signed-off-by: Kai Fricke <kai@anyscale.com> Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-07-27 14:24:19 +01:00
xwjiang2010	4c30325172	[air] update xgboost test (catch test failures properly). (#27023 ) - Update xgboost test (catch test failures properly) - Remove `path` from `from_model` for XGBoostCheckpoint and LightGbmCheckpoint. Signed-off-by: xwjiang2010 <xwjiang2010@gmail.com>	2022-07-27 12:18:51 +01:00
Kai Fricke	ce5c5d858b	[ci/release/RLlib] Fix IMPALA long running release test. (#27086 )	2022-07-27 12:38:32 +02:00
Avnish Narayan	f5a9a44b9c	[RLlib] Revert Revert Fix apex long running test (#26928 )	2022-07-26 15:10:25 -07:00
Balaji Veeramani	89f7f2a567	[Datasets] Add `size` parameter to `ImageFolderDatasource` (#26975 ) If you read a folder with differently-sized images, `ImageFolderDatasource` errors. This PR fixes the issue by resizing images to a user-specified size.	2022-07-26 14:57:38 -07:00
matthewdeng	1bb7651e95	[air] add smoke-test flag to tensorflow_benchmark (#26999 ) Increase ratio from 1.15 to 1.2 Signed-off-by: Matthew Deng <matt@anyscale.com>	2022-07-26 15:47:37 +01:00
Sihan Wang	8ecd928c34	[Serve] Make the checkpoint and recover only from GCS (#26753 )	2022-07-25 14:24:53 -07:00
Chen Shen	acbab51d3e	[Nightly] fix microbenchmark scripts (#26947 ) Signed-off-by: scv119 scv119@gmail.com Why are these changes needed? microbenchmarks failed complaining raise ValueError(f"Malformed address: {address}") ValueError: Malformed address: this is due to `55a0f7b` and fix it by set RAY_ADDRESS="local"	2022-07-24 14:16:43 -07:00
Avnish Narayan	a50a81a13a	Revert "[RLlib] Fix apex breakout release test performance. (#26867 )" (#26927 )	2022-07-23 17:27:50 +02:00
Avnish Narayan	2cfd6c2e97	[RLlib] Fix apex breakout release test performance. (#26867 )	2022-07-23 13:53:03 +02:00
Richard Liaw	96e8027c7e	[air] large tune/torch benchmark (#26763 ) Co-authored-by: Kai Fricke <krfricke@users.noreply.github.com>	2022-07-23 01:17:25 -07:00
Jiao	840b0478aa	[AIR CUJ] Add wait_for_nodes for 4x4 gpu test	2022-07-22 16:04:54 -07:00
Steven Morad	259429bdc3	Bump gym dep to 0.24 (#26190 ) Co-authored-by: Steven Morad <smorad@anyscale.com> Co-authored-by: Avnish <avnishnarayan@gmail.com> Co-authored-by: Avnish Narayan <38871737+avnishn@users.noreply.github.com>	2022-07-22 12:37:16 -07:00
Avnish Narayan	82395c4646	[RLlib] Put learning test into own folders (#26862 ) Co-authored-by: Artur Niederfahrenhorst <artur@anyscale.com>	2022-07-22 11:20:47 -07:00
Avnish Narayan	67c0a69643	[Rllib] Fix broken cluster env launcher gym pinning (#26865 )	2022-07-21 20:45:16 -07:00
matthewdeng	14e2b2548c	[air] update remaining dict scaling_configs (#26856 )	2022-07-21 18:55:21 -07:00
Balaji Veeramani	ac1d21027d	[AIR] Add framework-specific checkpoints (#26777 )	2022-07-20 19:33:27 -07:00
Archit Kulkarni	e043f49957	[Serve] [CI] Increase instance size and add debug log for `autoscaling_multi_deployment` release test (#26732 )	2022-07-20 16:13:36 -07:00
Kai Fricke	2e35d47bd2	[air/train/benchmark] Add TF GPU 4x4 benchmark (#26776 )	2022-07-20 14:07:51 -07:00
Avnish Narayan	5433c11650	[RLlib] Pin gym to 0.23.1 (#26752 )	2022-07-20 11:49:01 -07:00

1 2 3 4 5 ...

751 commits