hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 02:21:39 -05:00

Author	SHA1	Message	Date
Dmitri Gekhtman	1fee0159b4	[test][k8s] Minor adjustment to manual K8s tests (#21924 ) This PR is a minor adjustment to the K8s release tests. Replace tasks with actors in scale test for reduced flakiness Use an up-to-date Ray client API.	2022-01-27 20:07:14 -08:00
iasoon	b0700e676b	[serve] add root_path setting (#21090 ) Support hosting a serve instance under a path prefix. Some clean-up should still be done for the different overlapping HttpOptions that now exist (host, port, root_path, root_url).	2022-01-27 16:36:22 -06:00
Sriram Sankar	b7391a1c39	[autoscaler] Optimize finding the node id (#21885 ) This is a simple refactoring change and my first PR in ray-project. This change moves an if statement outside of a loop. This way the check is not repeated for each iteration.	2022-01-27 10:51:59 -08:00
Victor Yap	8be5f016af	Add NVIDIA_TESLA_A100 to accelerator types (#21558 ) Adds Nvidia's A100 to the list of accelerator types. AWS offers this in the p4d.24xlarge instance type.	2022-01-27 10:47:09 -08:00
Kai Fricke	8dcd4a99ef	[tune/wandb] Use `resume=False` per default (#21892 ) The WandbLoggingCallback is run on the driver side, with the experiment directory was the cwd. Using resume=True will pick up state from other trials (as the file name is global), and thus lead to warning messages. Thus, we should default to resume=False when using the callback. This PR also incorporates changes from #20966. Co-authored by: Queimo <queimo@gmx.net> Co-authored by: Karim <karim.ben.hicham@rwth-aachen.de>	2022-01-27 07:58:01 +00:00
Yi Cheng	e6bbafc17a	[function table] Make sure FunctionsToRun are executed properly on all workers (#21867 ) This PR fix the issue that sometimes FunctionsToRun is not executed. We isolated the Functions/Actors in function table, but not the RunctionsToRun. So when doing importing, sometimes, some functions will be missed. This PR fixed this.	2022-01-26 21:58:43 -08:00
SangBin Cho	d363c37078	[Core] Stop Ray stop from killing redis that's not started by Ray (#21805 ) Currently, `ray stop` logic is vulnerable, and it kills Redis server that's not started by Ray. This PR fixes the issue by better checking the executable name of redis-server (If it is redis-server created by Ray, it should contain Ray specific path copied while wheels are built). I originally tried to obtain ppid and kill a redis-server only when it is created from the same parent, but it turns out all processes started by ray start has no ppid. While the best solution is to have some "process manager" that we can detect redis server started by us, I think there's no need to put lots of efforts here right now since Redis will be removed soon. We will eventually move to a better direction (process manager) to handle this sort of issues.	2022-01-26 18:12:38 -08:00
Dmitri Gekhtman	757b5a88ea	[autoscaler] Cap min and max workers for manually managed on-prem clusters. (#21710 ) Closes https://github.com/ray-project/ray/issues/19636 by capping min and max workers for manually managed on-prem clusters to the number of user-specified worker ips. See https://github.com/ray-project/ray/issues/19636#issuecomment-1016664169 for additional context.	2022-01-26 18:03:55 -08:00
Simon Mo	ac6709f0ba	[Serve] Fix uvicorn duplicate header issue (#21884 )	2022-01-26 14:43:18 -08:00
xwjiang2010	80af046b54	[tune] deflake testBadParams5. (#21898 ) The test is timing out during actor creation and ends up not testing the code which is only triggered after a training result is returned back to driver. Change to use a simpler Trainable.	2022-01-26 19:38:15 +00:00
SangBin Cho	e62c0052a0	[Dashboard] Agent in minimal ray installation (#21817 ) This is the second part of https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit#. After this PR, dashboard agents will fully work with minimal ray installation. Note that this PR requires to introduce "aioredis", "frozenlist", and "aiosignal" to the minimal installation. These dependencies are very small (or will be removed soon), and including them to minimal makes thing very easy. Please see the below for the reasoning.	2022-01-26 04:03:54 -08:00
Alex Wu	7a45f60dbc	[autoscaler] Fix ray.autoscaler.sdk import issue (#21795 ) This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-01-25 14:43:24 -08:00
Wilson Wang	30a4761592	Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790 ) This patch fixed two issues. 1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py. 2. it is possible that the signal handler can raise another exception during exception handling.	2022-01-25 14:07:26 -08:00
Ian Rodney	257bd2d1e7	[Cleanup] Use `mkstemp` (#21676 ) `tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.	2022-01-25 13:42:12 -08:00
Dhruv Nair	3d79815cd0	Comet Integration (#20766 ) This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/). Co-authored-by: Michael Cullan <mjcullan@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-25 11:42:00 -08:00
Clark Zinzow	1971a08b7d	[RFC] [Core] Support disabling log redirection via `RAY_LOG_TO_STDERR` environment variable. (#21767 )	2022-01-25 10:52:53 -08:00
Gagandeep Singh	395297a9bd	Unskip tests for Windows in `test_output` (#21775 )	2022-01-25 09:25:01 -08:00
Matti Picus	d3d1e8559c	enable passing metric tests on windows (#21755 ) Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR. Here is the original description: >Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.	2022-01-25 09:20:16 -08:00
SangBin Cho	b2cd123522	[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806 ) There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`	2022-01-25 00:42:52 -08:00
Gagandeep Singh	290f3172ad	Unskipped tests for Windows in `test_client.py` (#21824 ) All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.	2022-01-24 22:51:54 -08:00
Lixin Wei	bc55a958c4	[Core] Support UTF-8 Actor Creation Exceptions (#21807 ) Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive. This is because the following exception occurred during handling the user's exception: ``` File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver worker.core_worker.push_error(job_id, error_type, message, time.time()) File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128) An unexpected internal error occurred while the worker was executing a task. ``` This PR fixes this issue.	2022-01-24 20:27:43 -08:00
Andrew A. Naguib	f026376556	[Tune] PTL replace deprecated `running_sanity_check` with `sanity_checking` (#21831 ) `running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`	2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang	99b287d236	[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571 ) * Fix workflow recovery issue due to a bug of dynamic output * add tests	2022-01-24 15:34:57 -08:00
DK.Pino	c2199a50e3	[Placement Group] Fix remove pg flaky when worker startup slow (#20474 ) Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.	2022-01-24 15:30:04 -08:00
mwtian	a10d05ce27	[Bootstrap] fix log format (#21826 )	2022-01-24 15:06:41 -08:00
Yi Cheng	57afb2f75a	[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771 ) RayDP needs to be updated to work with redisless ray. To be more specific this [line](`c08a786770/python/raydp/spark/ray_cluster_master.py (L146)` ) needs to be updated to using `node.address` We should update this after the release with the feature being turned on by default.	2022-01-24 14:43:31 -08:00
shrekris-anyscale	03d93ba7ee	Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765 ) Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2022-01-24 16:36:04 -06:00
Sven Mika	c288b97e5f	[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670 )	2022-01-24 19:38:21 +01:00
Antoni Baum	850eb88cde	[tune] Fix analysis without registered trainable (#21475 ) This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.	2022-01-24 08:27:08 -08:00
Guyang Song	f8e41215b3	[1/n][cross-language runtime env] runtime env protobuf refactor (#21551 ) We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf. Related issue #21731	2022-01-24 19:24:59 +08:00
Chen Shen	a60251f47a	[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224 ) * add changes * as title * fix * max to min * fix tests	2022-01-24 01:52:59 -08:00
Lingxuan Zuo	ec62d7f510	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 ) New repo url is https://github.com/ray-project/mobius Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-23 17:53:41 +08:00
Gagandeep Singh	2da2ac52ce	Unskipped test_worker_stdout (#21708 )	2022-01-22 02:43:03 -08:00
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
Gagandeep Singh	b00385f9a2	Using a deterministic approach to check connections in `test_job_timestamps` (#21693 )	2022-01-21 20:18:05 -08:00
xwjiang2010	0abcd5eea5	[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658 ) By default, ~/ray_results/exp_name/trial_name/checkpoint_name. Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory. Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process. This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.	2022-01-21 17:56:05 -08:00
matthewdeng	8119b62640	[train] refactor callback logdir and results preprocessors (#21468 ) * [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors * simplify profiler * read on get_and_clear_profile_traces * refactor callbacks * remove var * Update python/ray/train/callbacks/logging.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * Update python/ray/train/callbacks/results_prepocessors/keys.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * address comments; add tests * fix test * address comments * docs * address comments' * fix test Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-21 17:23:34 -08:00
Junwen Yao	216c4bf9a6	[Serve] warn when serve.start() with different options (#21562 )	2022-01-21 15:51:26 -08:00
shrekris-anyscale	45eebdd6e3	[Serve] Handle name collisions when unzipping directories in unzip_package (#21723 ) Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld. This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`. This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.	2022-01-21 15:27:28 -06:00
shrekris-anyscale	75b3080834	[Serve] Serve Autoscaling Release tests (#21208 )	2022-01-21 12:08:25 -08:00
mwtian	c85546a884	[Test] increase timeout for `test_traceback.py` (#21765 ) `test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.	2022-01-20 23:57:49 -08:00
SangBin Cho	5514711a35	[Part 5] Set actor died error message in ActorDiedError (#20903 ) This is the second last PR to improve `ActorDiedError` exception. This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception. After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. TODO - [x] Fix test failures - [x] Add unit tests - [x] Fix Java/cpp cases Follow up PRs - Not allowing nullptr for RayErrorInfo input.	2022-01-20 22:11:11 -08:00
mwtian	0dbe4b3a56	[Pubsub] fix driver warning for not keeping up with worker logs (#21717 ) GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.	2022-01-20 16:32:42 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Yi Cheng	3c63a8410d	[gcs/ha] Fix java related error when enable redisless ray (#21692 ) This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.	2022-01-20 13:56:25 -08:00
Jiajun Yao	fa9feb5033	Fix replace_symlinks_with_junctions for windows (#21720 ) Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43	2022-01-20 12:38:56 -08:00
matthewdeng	976ba5dbfe	[train] fix fashion mnist example (#21689 ) Making some minor fixes. 1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size. 2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct. 3. Add `model.train()` for generality. 4. Remove `smoke-test` flag since it's not really being used.	2022-01-20 12:26:02 -08:00
SangBin Cho	b6d3e01e0b	Revert "WINDOWS: enable passing metric tests (#21705 )" (#21738 ) This reverts commit `8104fd5c76`.	2022-01-20 07:27:49 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00

... 3 4 5 6 7 ...

6141 commits