hiro/ray - Forgejo: Beyond coding. We Forge.

hiro/ray

mirror of https://github.com/vale981/ray synced 2025-03-06 10:31:39 -05:00

Author	SHA1	Message	Date
Clark Zinzow	d51df512eb	[1.11.0] [Cherry-pick] [Datasets] Fix boolean tensor column representation and slicing. (#22358 ) Reformatted cherry-pick of `443416907e`. This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.	2022-02-14 11:45:50 -08:00
Edward Oakes	c48ad5cf13	[serve] Fix HTTP proxy controller namespace bug (#22287 ) (#22355 ) Closes https://github.com/ray-project/ray/issues/22265 This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.	2022-02-14 11:33:06 -08:00
Mingwei Tian	fee8947c23	[Release branch] Update Python version to 1.11.0rc0	2022-02-14 10:05:53 -08:00
Chen Shen	a847fa3643	[Dataset] avoid pyarrow 7.0.0 for dataset (#22253 ) (#22330 )	2022-02-14 08:06:11 -08:00
Archit Kulkarni	789274c179	[runtime env] [1.11.0 release cherry-pick] fix bug where pip options don't work in `requirements.txt` (#22127 ) * [runtime env] Fix bug where options (e.g. `--extra-index-url`) could not be specified in `requirements.txt` (#22065) In https://github.com/ray-project/ray/pull/20341 the behavior of `pip` was changed to install the specified packages in the existing environment rather than in a new environment. This posed a problem when specifying Ray libraries like "ray[serve]" in the `pip` field, because the installer would install Ray at runtime and this new Ray would take precedence over the Ray existing on the cluster. This could cause version mismatch issues. Skipping some details, the approach taken in the that PR was essentially to parse the `pip` list and remove Ray. However not every line in a `pip` `requirements.txt` file is a requirements specifier; a line can also just specify options, like `--extra-index-url my-index-url.com`. This caused the parsing library to raise an exception when trying to parse the line. This PR fixes this by catching the exception and skipping the line in this case, since it's not a line that specifies `ray` and that's all we're looking for when parsing. * lint using old linter from pre-1.11.0-branch-cut	2022-02-14 07:13:37 -08:00
Alex Wu	7a45f60dbc	[autoscaler] Fix ray.autoscaler.sdk import issue (#21795 ) This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way. Co-authored-by: Alex Wu <alex@anyscale.com>	2022-01-25 14:43:24 -08:00
Wilson Wang	30a4761592	Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790 ) This patch fixed two issues. 1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py. 2. it is possible that the signal handler can raise another exception during exception handling.	2022-01-25 14:07:26 -08:00
Ian Rodney	257bd2d1e7	[Cleanup] Use `mkstemp` (#21676 ) `tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.	2022-01-25 13:42:12 -08:00
Dhruv Nair	3d79815cd0	Comet Integration (#20766 ) This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/). Co-authored-by: Michael Cullan <mjcullan@gmail.com> Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-25 11:42:00 -08:00
Clark Zinzow	1971a08b7d	[RFC] [Core] Support disabling log redirection via `RAY_LOG_TO_STDERR` environment variable. (#21767 )	2022-01-25 10:52:53 -08:00
Gagandeep Singh	395297a9bd	Unskip tests for Windows in `test_output` (#21775 )	2022-01-25 09:25:01 -08:00
Matti Picus	d3d1e8559c	enable passing metric tests on windows (#21755 ) Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR. Here is the original description: >Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.	2022-01-25 09:20:16 -08:00
SangBin Cho	b2cd123522	[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806 ) There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`	2022-01-25 00:42:52 -08:00
Gagandeep Singh	290f3172ad	Unskipped tests for Windows in `test_client.py` (#21824 ) All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.	2022-01-24 22:51:54 -08:00
Lixin Wei	bc55a958c4	[Core] Support UTF-8 Actor Creation Exceptions (#21807 ) Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive. This is because the following exception occurred during handling the user's exception: ``` File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver worker.core_worker.push_error(job_id, error_type, message, time.time()) File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128) An unexpected internal error occurred while the worker was executing a task. ``` This PR fixes this issue.	2022-01-24 20:27:43 -08:00
Andrew A. Naguib	f026376556	[Tune] PTL replace deprecated `running_sanity_check` with `sanity_checking` (#21831 ) `running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`	2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang	99b287d236	[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571 ) * Fix workflow recovery issue due to a bug of dynamic output * add tests	2022-01-24 15:34:57 -08:00
DK.Pino	c2199a50e3	[Placement Group] Fix remove pg flaky when worker startup slow (#20474 ) Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.	2022-01-24 15:30:04 -08:00
mwtian	a10d05ce27	[Bootstrap] fix log format (#21826 )	2022-01-24 15:06:41 -08:00
Yi Cheng	57afb2f75a	[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771 ) RayDP needs to be updated to work with redisless ray. To be more specific this [line](`c08a786770/python/raydp/spark/ray_cluster_master.py (L146)` ) needs to be updated to using `node.address` We should update this after the release with the feature being turned on by default.	2022-01-24 14:43:31 -08:00
shrekris-anyscale	03d93ba7ee	Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765 ) Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve. Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Simon Mo <simon.mo@hey.com>	2022-01-24 16:36:04 -06:00
Sven Mika	c288b97e5f	[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670 )	2022-01-24 19:38:21 +01:00
Antoni Baum	850eb88cde	[tune] Fix analysis without registered trainable (#21475 ) This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.	2022-01-24 08:27:08 -08:00
Guyang Song	f8e41215b3	[1/n][cross-language runtime env] runtime env protobuf refactor (#21551 ) We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf. Related issue #21731	2022-01-24 19:24:59 +08:00
Chen Shen	a60251f47a	[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224 ) * add changes * as title * fix * max to min * fix tests	2022-01-24 01:52:59 -08:00
Lingxuan Zuo	ec62d7f510	[Streaming]Farewell : remove all of streaming related from ray repo. (#21770 ) New repo url is https://github.com/ray-project/mobius Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>	2022-01-23 17:53:41 +08:00
Gagandeep Singh	2da2ac52ce	Unskipped test_worker_stdout (#21708 )	2022-01-22 02:43:03 -08:00
Qing Wang	a37d9a2ec2	[Core] Support default actor lifetime. (#21283 ) Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item. #### API Change The Python API looks like: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) ``` Java API looks like: ```java System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name()); Ray.init(); ``` One example usage is: ```python ray.init(job_config=JobConfig(default_actor_lifetime="detached")) a1 = A.options(lifetime="non_detached").remote() # a1 is a non-detached actor. a2 = A.remote() # a2 is a non-detached actor. ``` Co-authored-by: Kai Yang <kfstorm@outlook.com> Co-authored-by: Qing Wang <jovany.wq@antgroup.com>	2022-01-22 12:26:08 +08:00
Gagandeep Singh	b00385f9a2	Using a deterministic approach to check connections in `test_job_timestamps` (#21693 )	2022-01-21 20:18:05 -08:00
xwjiang2010	0abcd5eea5	[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658 ) By default, ~/ray_results/exp_name/trial_name/checkpoint_name. Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory. Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process. This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.	2022-01-21 17:56:05 -08:00
matthewdeng	8119b62640	[train] refactor callback logdir and results preprocessors (#21468 ) * [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors * simplify profiler * read on get_and_clear_profile_traces * refactor callbacks * remove var * Update python/ray/train/callbacks/logging.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * Update python/ray/train/callbacks/results_prepocessors/keys.py Co-authored-by: Antoni Baum <antoni.baum@protonmail.com> * address comments; add tests * fix test * address comments * docs * address comments' * fix test Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>	2022-01-21 17:23:34 -08:00
Junwen Yao	216c4bf9a6	[Serve] warn when serve.start() with different options (#21562 )	2022-01-21 15:51:26 -08:00
shrekris-anyscale	45eebdd6e3	[Serve] Handle name collisions when unzipping directories in unzip_package (#21723 ) Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld. This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`. This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.	2022-01-21 15:27:28 -06:00
shrekris-anyscale	75b3080834	[Serve] Serve Autoscaling Release tests (#21208 )	2022-01-21 12:08:25 -08:00
mwtian	c85546a884	[Test] increase timeout for `test_traceback.py` (#21765 ) `test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.	2022-01-20 23:57:49 -08:00
SangBin Cho	5514711a35	[Part 5] Set actor died error message in ActorDiedError (#20903 ) This is the second last PR to improve `ActorDiedError` exception. This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception. After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. TODO - [x] Fix test failures - [x] Add unit tests - [x] Fix Java/cpp cases Follow up PRs - Not allowing nullptr for RayErrorInfo input.	2022-01-20 22:11:11 -08:00
mwtian	0dbe4b3a56	[Pubsub] fix driver warning for not keeping up with worker logs (#21717 ) GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.	2022-01-20 16:32:42 -08:00
xwjiang2010	9af8f11191	Revert "[docs] Clean up doc structure (first part) (#21667 )" (#21763 ) This reverts commit `38e46c9fb3`.	2022-01-20 15:30:56 -08:00
Yi Cheng	3c63a8410d	[gcs/ha] Fix java related error when enable redisless ray (#21692 ) This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.	2022-01-20 13:56:25 -08:00
Jiajun Yao	fa9feb5033	Fix replace_symlinks_with_junctions for windows (#21720 ) Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43	2022-01-20 12:38:56 -08:00
matthewdeng	976ba5dbfe	[train] fix fashion mnist example (#21689 ) Making some minor fixes. 1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size. 2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct. 3. Add `model.train()` for generality. 4. Remove `smoke-test` flag since it's not really being used.	2022-01-20 12:26:02 -08:00
SangBin Cho	b6d3e01e0b	Revert "WINDOWS: enable passing metric tests (#21705 )" (#21738 ) This reverts commit `8104fd5c76`.	2022-01-20 07:27:49 -08:00
Max Pumperla	38e46c9fb3	[docs] Clean up doc structure (first part) (#21667 )	2022-01-20 16:19:04 +01:00
mwtian	a4581e58ee	[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712 ) - Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers. - Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.	2022-01-20 07:04:54 -08:00
Hao Chen	8dcc07ec9c	[Fix][Locality] ref count should remove object locations for dead nodes (#21548 ) When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.	2022-01-20 11:58:52 +08:00
Philipp Moritz	fbc51d6d0e	[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086 ) This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).	2022-01-19 19:42:17 -08:00
Wilson Wang	2626c64060	Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533 ) GCS, when running as an individual component, can cause other components to fail in case of crashes. Here are two main cases covered in this patch: 1. monitor.py will raise an exception when disconnected from GCS. 2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start. In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small. Co-authored-by: Mingwei Tian <mwtian@anyscale.com>	2022-01-19 18:48:03 -08:00
Matti Picus	8104fd5c76	WINDOWS: enable passing metric tests (#21705 )	2022-01-19 17:09:34 -08:00
Eric Liang	88143cdc35	[data] Unify key function type and error handling across sort, groupby, and agg (#21627 ) Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same). This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...). It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.	2022-01-19 11:15:13 -08:00
Yi Cheng	82103bf7c1	[gcs/ha] Fix cpp tests related to redis removal (#21628 ) This PR fixed cpp tests and also make ray cpp able to pass.	2022-01-19 01:26:34 -08:00

1 2 3 4 5 ...

5935 commits