hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Balaji Veeramani	c694ed4594	[Train] Add `enable_reproducibility` (#22851 ) This PR adds a feature that allows user to make their training runs more reproducible. I've implemented this feature by following PyTorch's guide on how to limit sources of randomness (https://pytorch.org/docs/stable/notes/randomness.html). These changes will make it easier for us to benchmark Ray Train, and also make it easier for users to reproduce their experiments.	2022-03-15 11:07:34 -07:00
Amog Kamsetty	e1f24a244b	[ml/train] Training Interfaces [3/4]: `DataParallelTrainer` interface (#22988 ) Interface for DataParallelTrainer and updates to ScalingConfig definition. Depends on #22986 Co-authored-by: Eric Liang <ekhliang@gmail.com> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>	2022-03-15 08:11:05 -07:00
Max Pumperla	ad30123339	[docs] fix includes for md files (#23180 ) the include of content for md files like our central getting started page didn't render. fixed here. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-15 11:09:18 +00:00
Pamphile Roy	81b17669a4	[core][docs] Document port/IP binding and slurm concerns (#22663 ) Using Ray on SLURM system is documented but missing some pitfalls about network. This PR adds some information about port binding and address binding (I will open a feature request with more and link it here later). I did not put any real recommendation on this last point since `--address` did not work. I had cannot resolve issue after setting an internal IP although it's reachable.	2022-03-15 01:43:46 -07:00
Jules S. Damji	0246f3532e	[DOC] Added a full example how to access deployments (#22401 )	2022-03-14 21:15:52 -05:00
Jialing He	39a6c054d3	[runtime env][feature] introduce pip_check_enable and pip_version (#22826 )	2022-03-14 23:41:19 +08:00
Jiaxin Shan	8823ca48b4	[Workflow] Improve workflow docs (#23114 ) * [Workflow] Improve workflow docs * Update doc/source/workflows/concepts.rst Co-authored-by: Siyuan (Ryans) Zhuang <suquark@gmail.com>	2022-03-13 18:55:45 -07:00
Scott Graham	f673acb0ad	Scgraham/azure docs (#22296 ) Fixes potential error if function not found in azure sdk when deploying ray cluster on azure Adds additional python package needed to deploy ray cluster on azure in docs Co-authored-by: Scott Graham <scgraham@microsoft.com>	2022-03-13 18:08:08 -07:00
Kenneth	07372927cc	Enable buffering and spilling to multiple remote storages (#22798 ) Buffering writes to AWS S3 is highly recommended to maximize throughput. Reducing the number of remote I/O requests can make spilling to remote storages as effective as spilling locally. In a test where 512GB of objects were created and spilled, varying just the buffer size while spilling to a S3 bucket resulted in the following runtimes. Buffer Size \| Runtime (s) -- \| -- Default \| 3221.865916 256KB \| 1758.885839 1MB \| 748.226089 10MB \| 526.406466 100MB \| 494.830513 Based on these results, a default buffer size of 1MB has been added. This is the minimum buffer size used by AWS Kinesis Firehose, a streaming service for S3. On systems with larger availability, it is good to configure a larger buffer size. For processes that reach the throughput limits provided by S3, we can remove that bottleneck by supporting more prefixes/buckets. These impacts are less noticeable as the performance gains from using a large buffer prevent us from reaching a bottleneck. The following runtimes were achieved by spilling 512GB with a 1MB buffer and varying prefixes. Prefixes \| Runtime (s) -- \| -- 1 \| 748.226089 3 \| 527.658646 10 \| 516.010742 Together these changes enable faster large-scale object spilling. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-11 11:27:02 -05:00
matthewdeng	3a3a7b4be4	[test] add back deleted datasets train test file (#23051 )	2022-03-10 21:46:07 -08:00
Archit Kulkarni	52a722ffe7	[jobs] Make local pip/conda requirements files work with jobs (#22849 )	2022-03-10 15:15:16 -06:00
Max Pumperla	2b8faae40c	[docs] re/move old core examples (#22802 )	2022-03-10 12:17:00 -08:00
Max Pumperla	11c40e363d	[docs] external promo content (#22823 )	2022-03-10 11:39:44 -08:00
qicosmos	e4a9517739	[C++ Worker]Python call cpp worker (#22820 )	2022-03-10 11:06:14 -08:00
Max Pumperla	d8e862eaba	[docs] templates and contribution guide (fixes #21753 ) (#23003 ) Adding an explicit contributor guide and example templates for our users to help with docs. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-10 15:28:07 +00:00
Dmitri Gekhtman	413fe08f87	Move KubeRay autoscaler files into Ray autoscaler directory, add an entry-point. (#22847 ) This PR consists of the following clean-up items for KubeRay autoscaler integration: Remove the docker/kuberay directory Move the Python files formerly in docker/kuberay to the autoscaler directory. Use a rayproject/ray image for the autoscaler. Add an entry point for the kuberay autoscaler to scripts.py. Use the entry point in the example config. Slightly simplify the code that starts the autoscaler. Ray versions are updated to Ray 1.11.0, which will be officially released within the next couple of days. By default, Ray >= 1.11.0 runs without Redis. References to Redis are removed from the example config. Add the autoscaler configuration test to the CI. Update development documentation to reflect the changes in this PR.	2022-03-09 18:26:57 -08:00
Alex Wu	b84aaef38a	Promote python 3.9 support to stable (#22923 ) Remove the experimental note from python 3.9 since it and its core dependencies have been stable for quite some time now. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-03-08 17:24:54 -08:00
Eric Liang	52491c87e2	Make a pass fixing Dataset API issues (#22886 )	2022-03-08 13:07:55 -08:00
Max Pumperla	d6bff736f3	[docs] test ray.io snippets (#22822 ) Tests all snippets we have on ray.io. There were some minor issues, which I'll fix upstream. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-08 15:50:57 +00:00
Stephanie Wang	cb218d03b9	[core] Enable lineage reconstruction by default (#22816 ) Enables lineage reconstruction, which allows automatic recovery of task outputs, by default. Also adds an info message to the driver whenever objects need to be reconstructed (not including recursive reconstruction).	2022-03-07 17:40:30 -05:00
Max Pumperla	b609bdf898	[docs] Improve connection between library references and their APIs (#22800 ) Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-04 16:48:03 +01:00
Antoni Baum	283666fe02	[docs] Update XGBoost/LightGBM-Ray docs (#22783 ) Brings the docs up to date with XGBoost/LightGBM-Ray readmes.	2022-03-03 18:02:43 +01:00
Archit Kulkarni	e937f1a3c4	[runtime env] [Doc] add more details about runtime env logs (#22480 ) Clarifies the logging behavior for runtime envs, and adds the runtime env logs fileto the list of log files in the main logging page.	2022-03-02 14:27:28 -08:00
Max Pumperla	d53d0e0f50	[docs] Typo - fixes #22761 (#22763 ) Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-02 10:34:46 +01:00
Max Pumperla	7d4296c72f	run code in browser (#22727 ) Example for running notebooks on our docs directly in the browser by connecting to a binder instance launched on demand. If this seems useful we can extend this to other examples gradually. Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>	2022-03-02 10:27:00 +01:00
Archit Kulkarni	1752f17c6d	[Job submission] Add `list_jobs` API (#22679 ) Adds an API to the REST server, the SDK, and the CLI for listing all jobs that have been submitted, along with their information. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>	2022-03-01 21:27:09 -06:00
Eric Liang	5a0b7a7ee0	Document Dataset pipeline stage fusion (#22737 )	2022-03-01 14:38:09 -08:00
Eric Liang	e228544d39	Undo revert of windowing dataset by bytes (#22735 )	2022-03-01 12:24:04 -08:00
Kenneth	9b67cb5a6f	Add buffering to object spilling (#22618 ) This change is needed for object fusing to see performance increases on HDD. Currently, smaller object writes are slow even with fusing since the writes are not buffered (negating the point of fusing). Benchmarks show that while the default is sufficient for fast SSDs, on a slow HDD, increasing the buffer size reduces write times by several magnitudes. ### Performance Changes A microbenchmark where 500KB objects were produced (then spilled) and consumed to observe changes in object fusing/spilling. \| Run \| Produce (s) \| Consume (s) \| Total (s) \| \| -- \| -- \| -- \| -- \| \| Baseline (original) \| 347.332281 \| 355.611272 \| 705.560750 \| \| Baseline (w/ fix) \| 181.815852 \| 347.692850 \| 532.847759 \| \| No fusing (original) \| 453.574554 \| 525.047998 \| 981.620108 \| \| No fusing (w/ fix) \| 452.614848\| 519.787698 \| 975.412639 \| The baseline runs should be notably faster due to object fusing reducing I/O requests. With the fix, Ray's defaults allow this microbenchmark to have a 48% time reduction with negligible impact on runtime when fusing is disabled. See [this followup](https://github.com/ray-project/ray/pull/22618#issuecomment-1054838715) for information on the differences between SSD and HDD performance with different buffer sizes. Co-authored-by: Ubuntu <ubuntu@ip-172-31-54-240.us-west-2.compute.internal>	2022-03-01 10:13:10 -08:00
Stephanie Wang	73f078236f	[doc] Update docs about actor garbage collection (#20763 ) Update outdated actor docs about when actors are GCed.	2022-02-28 18:45:29 -08:00
Jiaxin Shan	32829ff9ad	[KubeRay] Provide a new Dockerfile for fast build (#22689 ) Adds a new Dockerfile for fast build and development of KubeRay.	2022-02-28 17:09:16 -08:00
Archit Kulkarni	85657b1377	[Doc] [Jobs] add CLI and SDK reference to docs (#22680 )	2022-02-28 17:57:46 -06:00
SangBin Cho	ba4f1423c7	Revert "Support creating a DatasetPipeline windowed by bytes (#22577 )" (#22695 ) This reverts commit `b5b4460932`.	2022-02-28 11:56:12 -08:00
Jialing He	98a69cbd90	[runtime env][strong-typed API] Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv` (#22522 ) Combine `ParsedRuntimeEnv` and `RuntimeEnv` into `ray.runtime.RuntimeEnv`, details: #21495 - The `new RuntimeEnv` includes all external interfaces of `ParsedRuntimeEnv` and `old RuntimeEnv`. - The `new RuntimeEnv` will be exposed directly to the user. - example: ```python runtime_env = ray.runtime_env.RuntimeEnv(working_dir="s3://workding_dir.zip", pip=["requests"], java_jars=["s3://jar1.zip"], java_jvm_options=["-Dxxx=xxx"]) ```	2022-02-28 16:18:10 +08:00
Max Pumperla	372c620f58	[docs] Tune overhaul part II (#22656 ) Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-02-26 23:07:34 -08:00
Eric Liang	b5b4460932	Support creating a DatasetPipeline windowed by bytes (#22577 )	2022-02-25 23:31:10 -08:00
Antoni Baum	d5284a740c	[tune] Remove `Trainable.update_resources` (#22471 )	2022-02-25 08:38:34 -08:00
xwjiang2010	d4a1bc7bc7	Revert "[runtime env] runtime env inheritance refactor (#22244 )" (#22626 ) Breaks train_torch_linear_test.py.	2022-02-25 08:42:30 -06:00
Eric Liang	533a0440a6	Improve actor pool support in Datasets (#22574 )	2022-02-24 12:01:36 -08:00
jon-chuang	11500dc12c	[docs] include ray status and ray monitor into ray command line api docs (#22614 ) Fixes: https://github.com/ray-project/ray/issues/18527	2022-02-23 20:09:45 -08:00
Amog Kamsetty	80e0d9cea4	[Train] Update docs for ray.train.torch import (#22555 ) Update more examples to include the ray.train.torch import line. Follow up to #21969	2022-02-23 19:22:27 -08:00
Edward Oakes	5a21289a34	[runtime_env] Remove get_current_runtime_env from docs (#22594 ) We should just encourage people to use the existing `get_runtime_context` API instead of introducing a new one here. Just removing the docs for now while we discuss this.	2022-02-23 16:53:52 -06:00
Archit Kulkarni	87f7bfe4cd	[doc] [job submission] Add k8s instructions and a comment about ports (#22598 )	2022-02-23 16:32:37 -06:00
Sven Mika	8e00537b65	[RLlib] SlateQ: framework=tf fixes and SlateQ documentation update (#22543 )	2022-02-23 13:03:45 +01:00
mwtian	9a157dfe82	[GCS-Ray] update doc and error message for GCS-Ray (#22528 ) Update documentation to reflect that Ray no longer starts Redis by default.	2022-02-22 17:56:30 -08:00
Dmitri Gekhtman	a402e956a4	[KubeRay] Format autoscaling config based on RayCluster CR (#22348 ) Closes #21655. At the start of each autoscaler iteration, we read the Ray Cluster CR from K8s and use it to extract the autoscaling config.	2022-02-22 11:06:37 -08:00
Antoni Baum	4a15c6f8f3	[tune] Preparation for deadline schedulers (#22006 )	2022-02-22 11:05:28 -08:00
Guyang Song	5783cdb254	[runtime env] runtime env inheritance refactor (#22244 ) Runtime Environments is already GA in Ray 1.6.0. The latest doc is [here](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#runtime-environments). And now, we already supported a [inheritance](https://docs.ray.io/en/master/ray-core/handling-dependencies.html#inheritance) behavior as follows (copied from the doc): - The runtime_env["env_vars"] field will be merged with the runtime_env["env_vars"] field of the parent. This allows for environment variables set in the parent’s runtime environment to be automatically propagated to the child, even if new environment variables are set in the child’s runtime environment. - Every other field in the runtime_env will be overridden by the child, not merged. For example, if runtime_env["py_modules"] is specified, it will replace the runtime_env["py_modules"] field of the parent. We think this runtime env merging logic is so complex and confusing to users because users can't know the final runtime env before the jobs are run. Current PR tries to do a refactor and change the behavior of Runtime Environments inheritance. Here is the new behavior: - If there is no runtime env option when we create actor, inherit the parent runtime env. - Otherwise, use the optional runtime env directly and don't do the merging. Add a new API named `ray.runtime_env.get_current_runtime_env()` to get the parent runtime env and modify this dict by yourself. Like: ```Actor.options(runtime_env=ray.runtime_env.get_current_runtime_env().update({"X": "Y"}))``` This new API also can be used in ray client.	2022-02-21 18:13:22 +08:00
Max Pumperla	29d94a2211	[docs] sphinx gallery removal, migrate to ipynb (#22467 )	2022-02-19 01:19:07 -08:00
Archit Kulkarni	8c12e30f11	[Doc] Add actor max restarts default value to fault tolerance doc (#22481 )	2022-02-18 17:48:22 -06:00

1 2 3 4 5 ...

1920 commits