hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Gagandeep Singh	395297a9bd	Unskip tests for Windows in `test_output` (#21775 )	2022-01-25 09:25:01 -08:00
Matti Picus	d3d1e8559c	enable passing metric tests on windows (#21755 ) Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR. Here is the original description: >Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.	2022-01-25 09:20:16 -08:00
Sven Mika	d5bfb7b7da	[RLlib] Preparatory PR for multi-agent multi-GPU learner (alpha-star style) #03 (#21652 )	2022-01-25 14:16:58 +01:00
SangBin Cho	b2cd123522	[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806 ) There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`	2022-01-25 00:42:52 -08:00
Gagandeep Singh	290f3172ad	Unskipped tests for Windows in `test_client.py` (#21824 ) All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.	2022-01-24 22:51:54 -08:00
Lixin Wei	bc55a958c4	[Core] Support UTF-8 Actor Creation Exceptions (#21807 ) Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive. This is because the following exception occurred during handling the user's exception: ``` File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver worker.core_worker.push_error(job_id, error_type, message, time.time()) File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128) An unexpected internal error occurred while the worker was executing a task. ``` This PR fixes this issue.	2022-01-24 20:27:43 -08:00
Guyang Song	089f49f554	[doc] fix doc of container-based runtime env (#21815 )	2022-01-25 12:23:15 +08:00
isaac-vidas	236fe58259	[Doc] Update requests calls to ray job submission api (#21802 )	2022-01-24 17:44:31 -08:00
Max Pumperla	7953c9ca57	[docs] integrate algolia docsearch, move to sphinx panels (#21814 )	2022-01-24 17:00:41 -08:00
Andrew A. Naguib	f026376556	[Tune] PTL replace deprecated `running_sanity_check` with `sanity_checking` (#21831 ) `running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`	2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang	99b287d236	[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571 ) * Fix workflow recovery issue due to a bug of dynamic output * add tests	2022-01-24 15:34:57 -08:00
DK.Pino	c2199a50e3	[Placement Group] Fix remove pg flaky when worker startup slow (#20474 ) Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.	2022-01-24 15:30:04 -08:00
SangBin Cho	7d4287a6ab	[Test] Move long running tests to run everyday (#21813 ) Long running tests are cheap and low overhead (small number of node usage). We should just promote this to run every day so we can catch regressions quickly.	2022-01-24 15:10:27 -08:00
SangBin Cho	ac5f38d7fd	[Test] Fix dask on ray test on K8s (#21816 ) Fix dash on ray large scale test on K8s. Basically, chmod requires a root access, which we don't have it by default in the k8s cluster. We don't need chmod I think (I verified the test passes without it).	2022-01-24 15:09:22 -08:00
mwtian	a10d05ce27	[Bootstrap] fix log format (#21826 )	2022-01-24 15:06:41 -08:00
Yi Cheng	57afb2f75a	[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771 ) RayDP needs to be updated to work with redisless ray. To be more specific this [line](`c08a786770/python/raydp/spark/ray_cluster_master.py (L146)` ) needs to be updated to using `node.address` We should update this after the release with the feature being turned on by default.	2022-01-24 14:43:31 -08:00
shrekris-anyscale	03d93ba7ee	Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765 ) Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2022-01-24 16:36:04 -06:00
Sven Mika	c288b97e5f	[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670 )	2022-01-24 19:38:21 +01:00
SangBin Cho	2010f13175	Fix dashboard test bug (#21742 ) Currently `wait_until_succeeded_without_exception` is used in the dashboard, and it returns True/False. Unfortunately, there are lots of code that doesn't assert on this method (which means things are not actually tested).	2022-01-24 11:38:51 -06:00
Antoni Baum	850eb88cde	[tune] Fix analysis without registered trainable (#21475 ) This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.	2022-01-24 08:27:08 -08:00
Guyang Song	08b8f3065b	add runtime env code owners (#21803 )	2022-01-24 19:25:16 +08:00
Guyang Song	f8e41215b3	[1/n][cross-language runtime env] runtime env protobuf refactor (#21551 ) We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf. Related issue #21731	2022-01-24 19:24:59 +08:00
SangBin Cho	6b4aac7a08	Promote unstable tests to stable (#21811 ) Promote tests that have passed 100% last 1 week to stable	2022-01-24 02:10:37 -08:00
Chen Shen	a60251f47a	[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224 ) * add changes * as title * fix * max to min * fix tests	2022-01-24 01:52:59 -08:00
Shawn	6603ad450a	[Java] print hang test case name (#21804 ) * print hang test case name * use getFullTestName	2022-01-23 23:56:44 -08:00
SangBin Cho	1ae14ec513	[Dashboard] Make dashboard / agent work in minimal ray installation 1/3. (#21774 ) This is the doc that explains how to achieve this: https://docs.google.com/document/d/12qP3x5uaqZSKS-A_kK0ylPOp0E02_l-deAbmm8YtdFw/edit?usp=sharing The fully working e2e prototype is here (it passes all tests): `cdad913883` This PR is pure refactoring. Basically it moves some of util functions that require optional_deps to `optional_utils` so that optional deps' util functions are not used in the minimal installation. Look below to see the steps. <img width="693" alt="Screen Shot 2022-01-21 at 4 38 44 AM" src="https://user-images.githubusercontent.com/18510752/150528494-c3cdedf4-3a66-4557-b540-61436b1dbab6.png">	2022-01-23 21:11:32 -08:00
SangBin Cho	babc03edf2	Add a threaded actor k8s test (#21739 ) Add threaded actor flaky test to k8s.	2022-01-23 20:12:57 -08:00
DK.Pino	8cd7a5c438	[Placement Group] Make placement group commit resource rpc request batched (#21240 ) This is one part of this refactor, #20715 , make the commit resource RPC requests batched per node.	2022-01-23 06:16:09 -08:00
Jiao	5d382cfeb3	[nit] remove decorator in test_cli.py (#21792 ) Full context see https://github.com/ray-project/ray/issues/21791 pytest work for "some" environments for this test and on CI master, but this decorator is still unnecessary and was introduced by mistake. So just remove it and see what happens with the original issue.	2022-01-23 06:05:05 -08:00
Lingxuan Zuo	ec62d7f510	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 ) New repo url is https://github.com/ray-project/mobius Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-23 17:53:41 +08:00
Gagandeep Singh	2da2ac52ce	Unskipped test_worker_stdout (#21708 )	2022-01-22 02:43:03 -08:00
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
Gagandeep Singh	b00385f9a2	Using a deterministic approach to check connections in `test_job_timestamps` (#21693 )	2022-01-21 20:18:05 -08:00
Archit Kulkarni	c4bf68a083	[runtime env] Short-circuit protobuf serialization/deserialization for empty runtime envs (#21788 ) * Fix check for empty runtime_env ("{}", not "") * also check for ""	2022-01-22 13:07:03 +09:00
xwjiang2010	0abcd5eea5	[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658 ) By default, ~/ray_results/exp_name/trial_name/checkpoint_name. Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory. Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process. This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.	2022-01-21 17:56:05 -08:00
mwtian	e8ce01c525	[Dashboard] offload blocking work to a thread pool (#21762 ) Currently, GCS KV client only has blocking API. Calling them from dashboard event loop can block other operations for many seconds, leading to failures such as taking too long (> 2min) to submit a job and making nightly tests fail (#21699). This PR offloads the blocking work to a separate thread. Implementing async GCS KV API will be done in the future.	2022-01-21 17:55:11 -08:00
matthewdeng	8119b62640	[train] refactor callback logdir and results preprocessors (#21468 ) * [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors * simplify profiler * read on get_and_clear_profile_traces * refactor callbacks * remove var * Update python/ray/train/callbacks/logging.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * Update python/ray/train/callbacks/results_prepocessors/keys.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * address comments; add tests * fix test * address comments * docs * address comments' * fix test Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-21 17:23:34 -08:00
matthewdeng	165a025641	[train] update worker batch size docs (#21761 ) Making it explicit how the user should think about batch size for PyTorch in a distributed setting, similar to what's already done for TensorFlow. ![image](https://user-images.githubusercontent.com/3967392/150421340-df73f574-8531-4626-88a6-b80442ea6b7f.png)	2022-01-21 17:22:47 -08:00
Junwen Yao	216c4bf9a6	[Serve] warn when serve.start() with different options (#21562 )	2022-01-21 15:51:26 -08:00
Max Pumperla	f9b71a8bf6	[docs] new structure (#21776 ) This PR consolidates both #21667 and #21759 (look there for features), but improves on them in the following way: - [x] we reverted renaming of existing projects `tune`, `rllib`, `train`, `cluster`, `serve`, `raysgd` and `data` so that links won't break. I think my consolidation efforts with the `ray-` prefix were a little overeager in that regard. It's better like this. Only the creation of `ray-core` was a necessity, and some files moved into the `rllib` folder, so that should be relatively benign. - [x] Additionally, we added Algolia `docsearch`, screenshot below. This is _much_ better than our current search. Caveat: there's a sphinx dependency that needs to be replaced (`sphinx-tabs`) by another, newer one (`sphinx-panels`), as the former prevents loading of the `algolia.js` library. Will follow-up in the next PR (hoping this one doesn't get re-re-re-re-reverted).	2022-01-21 15:42:05 -08:00
shrekris-anyscale	45eebdd6e3	[Serve] Handle name collisions when unzipping directories in unzip_package (#21723 ) Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld. This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`. This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.	2022-01-21 15:27:28 -06:00
Adam Golinski	2954bf9a48	[docs][tune] Fix typo in schedulers.rst (#21777 ) Fix typo in schedulers.rst	2022-01-21 13:21:01 -08:00
Chen Shen	c401e94d2e	[resource-reporting] refactor resource scheduler 1/n #21732	2022-01-21 12:39:21 -08:00
shrekris-anyscale	75b3080834	[Serve] Serve Autoscaling Release tests (#21208 )	2022-01-21 12:08:25 -08:00
Clark Zinzow	2cd3045b16	[Test Infra] Fix e2e.py help info for --report (#21757 ) This momentarily confused me as to whether --report would enable or disable reporting.	2022-01-21 03:29:50 -08:00
mwtian	c85546a884	[Test] increase timeout for `test_traceback.py` (#21765 ) `test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.	2022-01-20 23:57:49 -08:00
SangBin Cho	5514711a35	[Part 5] Set actor died error message in ActorDiedError (#20903 ) This is the second last PR to improve `ActorDiedError` exception. This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception. After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. TODO - [x] Fix test failures - [x] Add unit tests - [x] Fix Java/cpp cases Follow up PRs - Not allowing nullptr for RayErrorInfo input.	2022-01-20 22:11:11 -08:00
SangBin Cho	9728d98586	[Internal Observability] Event stats segfault due to thread safety 2/2 (#21737 ) This fixes the event stats segfault. I made a consistent repro script that can have 3~5 segfault per each `placement_group_mini_integration_test` and verified this fixes the issue by running it more than 5 times. Note that when GCS HA is on, the same repro uncovered a different shutdown bug from the pubsub (resubscription seems to happen, and currently idempotency is not supported). I will make a separate issue so that @mwtian can handle it. ## Root cause The problem was that we are calling `gcs_client_->Disconnect()` from the task execution event loop. Note that gcs_client_ uses io_service as its main event loop, meaning now Disconnect is called in a different thread. If you look at Disconnect method, it resets lots of pointers. Resetting pointers of attributes that can be used by other threads -> segfault. This fixes the issue by "not resetting pointers". We do this after we join the io service thread, cuz it can now guarantee the gcs client won't be used by other threads (resetting gcs_client_ is probably not necessary, but I added it as a safety check). Note that gcs_client is currently written in very complicated way, and we need some refactoring to make thread-safety & code structure cleaner. But I will do this task once I have bandwidth.	2022-01-20 22:10:34 -08:00
mwtian	f18a8bd87f	[Dashboard] turn a noisy `info` log into `debug` (#21746 ) Currently, dashboard log contains many repeated entries like `Received a log for 172.31.47.219 and autoscaler` which is too noisy.	2022-01-20 22:08:23 -08:00
Lingxuan Zuo	43ea467896	Ray support internal native deps reused (#21641 ) To make other system or internal project reuse ray deps bazel function, we need change this local accessing style to global accessing with ray-project namespace. Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-21 13:56:40 +08:00

1 2 3 4 5 ...

11103 commits