hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
shrekris-anyscale	665bdbff47	[serve] Exclude unset fields from Ray actor options (#23059 ) The `schema_to_deployment()` function preserve unset fields with unexpected default argument types. This change excludes unset fields in that function and also changes the dictionaries' default values to empty dicts.	2022-03-11 10:45:21 -06:00
Kai Fricke	a8bed94ed6	[ci/release] Always use full cluster address (#23067 ) Not using the full cluster address is deprecated and breaks Job usage for uploads/downloads: https://buildkite.com/ray-project/release-tests-branch/builds/135#2a03e47b-6a9a-42ff-9346-905725eb8d09	2022-03-11 16:31:21 +00:00
Kenneth	07372927cc	Enable buffering and spilling to multiple remote storages (#22798 ) Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally. In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes. Buffer Size \| Runtime (s) -- \| -- Default \| 3221.865916 256KB \| 1758.885839 1MB \| 748.226089 10MB \| 526.406466 100MB \| 494.830513 Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size. For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes. Prefixes \| Runtime (s) -- \| -- 1 \| 748.226089 3 \| 527.658646 10 \| 516.010742 Together these changes enable faster large-scale object spilling. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-11 11:27:02 -05:00
Kai Fricke	61295f8b58	[ml/checkpoint] Fix checkpoint location on remote node (#23068 ) Currently breaks tests where the checkpoint is stored on a remote node (e.g. via Ray client), e.g.: https://buildkite.com/ray-project/release-tests-branch/builds/132#6a4936a8-41dd-4fd2-9f02-976855cbd9b7 Instead, we can set the properties manually. In the future, we need a story on how to refer to checkpoints kept on remote nodes.	2022-03-11 15:38:21 +00:00
Jialing He	0cbbb8c1d0	[runtime env][core] Use Proto message `RuntimeEnvInfo` between user code and core_worker (#22856 )	2022-03-11 22:14:18 +08:00
SangBin Cho	965d609627	[Nightly test] Fix a minor syntax issue for core nightly tests (#23069 ) Add frequency to smoke tests Remove unnecessary alerts	2022-03-11 04:58:40 -08:00
Kai Fricke	5b2d58674b	[ci/release] Migrate horovod tests (#22951 ) Migrating horovod tests to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/125	2022-03-11 09:53:29 +00:00
Kai Fricke	aed17dd346	Revert "Revert "[ml/tune] Expose new checkpoint interface to users (#22741 )" (#23006 )" (#23009 ) This reverts commit `85598d9d10`. Test breakage was unrelated.	2022-03-11 09:51:41 +00:00
Tao Wang	10c03cb126	Migrating to flat hash map [GCS&util&common] (#22932 ) Next move of #19220. This pr replace unordered_map to flat_hash_map in most GCS code and some util & common modules. The placement group part, which exposes user interfaces in Java/Python, is exclusive as it's a little bit complicated. The follow-up PRs would be migrating in core worker, placement group and others.	2022-03-11 18:35:06 +09:00
Yi Cheng	ec88eb7d1d	[4][resource reporting] Remove ray syncer from gcs_resource_manager (#22832 ) This PR is part of resource reporting refactoring. In this PR ray syncer is moved from gcs_resource_manager to gcs_placement_group_scheduler. With this one, gcs_resource_manager is totally decoupled from resource broadcasting.	2022-03-11 01:15:25 -08:00
Jialing He	0c5440ee72	[runtime env] Deletes the proto cache on RuntimeEnv (#22944 ) Mainly the following things: - This PR deletes the proto cache on RuntimeEnv, ensuring that the user's modification of RuntimeEnv can take effect in the Proto message. - validate whole runtime env when serialize runtime_env. - overload method `__setitem__` to parse and validate field when it has to modify.	2022-03-11 15:37:18 +08:00
matthewdeng	3a3a7b4be4	[test] add back deleted datasets train test file (#23051 )	2022-03-10 21:46:07 -08:00
Amog Kamsetty	f80602b7d2	[Datasets] Separate pandas to torch conversion in `to_torch` (#22939 ) Separate out the conversion of pandas dataframe to torch tensor in a utility function so that the same logic can be used in other places in Ray ML (for example during inference).	2022-03-10 20:40:01 -08:00
xwjiang2010	4b28bc3f09	[Tuner part1] Add Tuner interface. (#22975 )	2022-03-10 19:55:59 -08:00
Siyuan (Ryans) Zhuang	0a9f966e63	[workflow] Convert DAG to workflow (#22925 ) * convert DAG to a workflow * deduplicate * check duplication of steps * add test for object refs	2022-03-10 19:40:14 -08:00
Eric Liang	148eaeac2e	[minor] Leave a big of wiggle room when calculating shared memory max (#23034 )	2022-03-10 17:37:26 -08:00
Amog Kamsetty	9bd00f3e1a	[ml/train] Remove `ConvertibleToTrainable` and move `Trainer` to `ray.ml.trainer` (#23030 ) As discussed, - Removes ConvertibleToTrainable interface and makes as_trainable part of the Trainer interface - Moves Trainer interface to ray.ml.trainer from ray.ml.train.trainer	2022-03-10 15:24:58 -08:00
SangBin Cho	ebac18d163	[Nightly test] Support Job based file manager + runner (#22860 ) This PR supports the job-based file manager and runner. It will be the backbone of k8s migration. The PR handles edge cases that originally existed in the old e2e.py job-based runners.	2022-03-10 15:03:50 -08:00
Edward Oakes	5a18802ad7	[serve] Remove runtime-env arg from serve start (#23017 )	2022-03-10 15:15:59 -06:00
Archit Kulkarni	52a722ffe7	[jobs] Make local pip/conda requirements files work with jobs (#22849 )	2022-03-10 15:15:16 -06:00
Amog Kamsetty	a5f41b2c9f	[ml/train] Training Interfaces [1/4]: Ray AIR `Trainer` interface (#22980 )	2022-03-10 13:12:44 -08:00
Guyang Song	3d9f214833	[runtime env] Fix import in subprocess when using pip in runtime_env (#22983 ) Fix the issue https://github.com/ray-project/ray/issues/22968	2022-03-10 15:11:41 -06:00
Max Pumperla	2b8faae40c	[docs] re/move old core examples (#22802 )	2022-03-10 12:17:00 -08:00
xwjiang2010	b1496d235f	[tune] fix error handling for fail_fast case. (#22982 )	2022-03-10 20:10:05 +00:00
Simon Mo	832354ce3f	[Serve] Compatibility bridge between model wrappers and pipeline (#22995 )	2022-03-10 11:52:03 -08:00
Chen Shen	3ebc4ae289	fix comments and typo (#23008 ) Fix comments and typos for scheduler code.	2022-03-10 11:40:31 -08:00
Max Pumperla	11c40e363d	[docs] external promo content (#22823 )	2022-03-10 11:39:44 -08:00
Yi Cheng	9f275c9bb8	[3][resource reporting] Use GCS to report the placement group creation information instead of reporting by raylet (#22597 )	2022-03-10 11:08:21 -08:00
qicosmos	e4a9517739	[C++ Worker]Python call cpp worker (#22820 )	2022-03-10 11:06:14 -08:00
Yi Cheng	bb5fa6b851	Remove redis in setup.py (#22979 )	2022-03-10 11:05:03 -08:00
Archit Kulkarni	c78bd809ce	[job submission] Support local py_modules in jobs (#22843 )	2022-03-10 11:42:25 -06:00
Stephanie Wang	85598d9d10	Revert "[ml/tune] Expose new checkpoint interface to users (#22741 )" (#23006 ) This reverts commit `e9692a2a80`.	2022-03-10 17:07:44 +00:00
SangBin Cho	92b50ff5da	Migrate multi nightly tests (#23005 )	2022-03-11 01:32:10 +09:00
shrekris-anyscale	1100c98222	[serve] Implement Serve Application object (#22917 ) The concept of a Serve Application, a data structure containing all information needed to deploy Serve on a Ray cluster, has surfaced during recent design discussions. This change introduces a formal Application data structure and refactors existing code to use it.	2022-03-10 10:28:29 -06:00
Max Pumperla	d8e862eaba	[docs] templates and contribution guide (fixes #21753 ) (#23003 ) Adding an explicit contributor guide and example templates for our users to help with docs. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-10 15:28:07 +00:00
Jiajun Yao	2e828cc9e1	Delete dead test_setup_worker.py (#22970 ) The tested code is dead so we can remove the code and the test.	2022-03-10 07:20:41 -08:00
SangBin Cho	d192ec30fd	[Nightly Tests] Readjust the concurrency limit. (#23002 ) This PR reduces the concurrency limit. Based on the back of envelope calculation, the current concurrency limit can easily exceed the service quota. Given large == 2048 vCPUs, it will use about 20K vCPUs, which is slightly larger than the limit.	2022-03-10 07:19:38 -08:00
SangBin Cho	4fa294ca49	[Nightly tests] Stop running broken tests (#22993 )	2022-03-10 06:59:51 -08:00
SangBin Cho	e88abe4c8e	[Nightly tests] migrated most of daily tests (#22960 ) * migrated most of daily tests * Addressed code review.	2022-03-10 05:49:16 -08:00
Antoni Baum	bf49d37176	[tune] Add `Trainable.postprocess_checkpoint` (#22973 ) Adds postprocess_checkpoint method to Trainable to facilitate the checkpointing of preprocessors in AIR.	2022-03-10 12:14:39 +00:00
Tao Wang	bc14512471	[Hotfix]Fix test_actor failure caused by interface change (#23000 )	2022-03-10 19:34:12 +08:00
Kai Fricke	007cf03d7a	[ci/release] Migrate RLLib tests (#22967 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/111	2022-03-10 10:26:03 +00:00
Kai Fricke	fee4065daf	[ci/release] Migrate SGD tests (#22966 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/110	2022-03-10 10:23:50 +00:00
Kai Fricke	614dc6b511	[ci/release] Migrate Serve tests (#22965 ) Migrate to new release package. https://buildkite.com/ray-project/release-tests-branch/builds/109	2022-03-10 10:23:25 +00:00
Kai Fricke	ccda1555cc	[ci/release] Migrate Runtime Env tests (#22963 ) Migrating to new release test package. https://buildkite.com/ray-project/release-tests-branch/builds/108	2022-03-10 10:22:57 +00:00
Kai Fricke	e9692a2a80	[ml/tune] Expose new checkpoint interface to users (#22741 ) This PR exposes the new checkpoint interface, implemented in #22691, to end users. It does this by replacing the old external facing TrialCheckpoint class with a merged class that supports the old TrialCheckpoint API (upload, download, save) as well as the new Checkpoint API. With this PR, users can use the new Checkpoint interface for downstream processing of their Ray Tune results. In a follow-up PR, the new Checkpoint interface will be used internally within Ray Tune and Train for bookkeeping, however, that is not required to unblock the Ray ML use case.	2022-03-10 10:20:24 +00:00
kyle-chen-uber	592656ca28	[horovod] remove deprecated slot concept, use worker instead (#22708 ) Horovod updated the attributes of DistributedTrainableCreator and args to create Horovod RayExecutor. horovod/horovod@a729ba7 The major issue is Horovod deprecated "slot" concept, use "worker" instead, which is more consistent with Generic Ray worker. The issue is currently blocking Uber DL trainers to use raytune. This commit updates the Horovod RayExecutor init args. Co-authored-by: Kai Fricke <kai@anyscale.com>	2022-03-10 08:16:42 +00:00
Kai Fricke	18d535f290	[ci/release] Migrate LightGBM tests (#22952 ) Note that LightGBM release tests were previously not enabled. https://buildkite.com/ray-project/release-tests-branch/builds/113 https://buildkite.com/ray-project/release-tests-branch/builds/114 Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-03-10 08:14:31 +00:00
Edward Oakes	22e698d0ff	[serve][release tests] Add smoke test to CI for remaining tests (#22962 )	2022-03-09 23:36:32 -06:00
shrekris-anyscale	bc82e2d5c4	[serve] Restore "[serve] Support working_dir in serve run (#22760 )" (#22971 )	2022-03-09 21:31:23 -08:00

1 2 3 4 5 ...

11734 commits