Commit graph

5935 commits

Author SHA1 Message Date
Clark Zinzow
d51df512eb
[1.11.0] [Cherry-pick] [Datasets] Fix boolean tensor column representation and slicing. (#22358)
Reformatted cherry-pick of 443416907e.

This PR fixes our {NumPy, Pandas} <--> Arrow interop for boolean tensor columns. NumPy and Pandas represent boolean arrays with a byte per boolean, while Arrow bit-packs booleans with 8 booleans per byte. Previously, when casting NumPy arrays to tensor columns, we were interpreting NumPy's boolean array buffers as being bit-packed when they were not. This PR completes support by packing and unpacking bits for boolean arrays when creating a boolean tensor column from an ndarray and when creating an ndarray from a boolean tensor column, respectively.
2022-02-14 11:45:50 -08:00
Edward Oakes
c48ad5cf13
[serve] Fix HTTP proxy controller namespace bug (#22287) (#22355)
Closes https://github.com/ray-project/ray/issues/22265

This was caused by implicitly inferring the namespace from within the HTTP proxy when calling `get_handle`. This makes me think we really need to simplify the namespace handling logic.
2022-02-14 11:33:06 -08:00
Mingwei Tian
fee8947c23
[Release branch] Update Python version to 1.11.0rc0 2022-02-14 10:05:53 -08:00
Chen Shen
a847fa3643
[Dataset] avoid pyarrow 7.0.0 for dataset (#22253) (#22330) 2022-02-14 08:06:11 -08:00
Archit Kulkarni
789274c179
[runtime env] [1.11.0 release cherry-pick] fix bug where pip options don't work in requirements.txt (#22127)
* [runtime env] Fix bug where options (e.g. `--extra-index-url`) could not be specified in `requirements.txt` (#22065)

In https://github.com/ray-project/ray/pull/20341 the behavior of `pip` was changed to install the specified packages in the existing environment rather than in a new environment.  This posed a problem when specifying Ray libraries like "ray[serve]" in the `pip` field, because the installer would install Ray at runtime and this new Ray would take precedence over the Ray existing on the cluster.  This could cause version mismatch issues.  Skipping some details, the approach taken in the that PR was essentially to parse the `pip` list and remove Ray.

However not every line in a `pip` `requirements.txt` file is a requirements specifier; a line can also just specify options, like `--extra-index-url my-index-url.com`.
 This caused the parsing library to raise an exception when trying to parse the line.  This PR fixes this by catching the exception and skipping the line in this case, since it's not a line that specifies `ray` and that's all we're looking for when parsing.

* lint using old linter from pre-1.11.0-branch-cut
2022-02-14 07:13:37 -08:00
Alex Wu
7a45f60dbc
[autoscaler] Fix ray.autoscaler.sdk import issue (#21795)
This PR moves the sdk to its own folder, then includes everything in `import ray.autoscaler.sdk` in ray's import path. 

Note: that there were circular dependencies in naively doing this because the ray core now uses constants that were defined in the autoscaler for internal kv operations (and the autoscaler similarly calls into the ray core). The solution was to move those internal kv keys into ray core constants so the imports flow (more) one way.

Co-authored-by: Alex Wu <alex@anyscale.com>
2022-01-25 14:43:24 -08:00
Wilson Wang
30a4761592
Two issues fix for GCS connecting logic in monitor.py and log_monitor.py (#21790)
This patch fixed two issues.

1. log_monitor.py can crash when gcs is not temporarily available. Added retry logic in gcs_pubsub.py.
2. it is possible that the signal handler can raise another exception during exception handling.
2022-01-25 14:07:26 -08:00
Ian Rodney
257bd2d1e7
[Cleanup] Use mkstemp (#21676)
`tempfile.mktemp` is technically deprecated in favor of `tempfile.mkstemp`. 
Ref: https://docs.python.org/3/library/tempfile.html#deprecated-functions-and-variables.
2022-01-25 13:42:12 -08:00
Dhruv Nair
3d79815cd0
Comet Integration (#20766)
This PR adds a `CometLoggerCallback` to the Tune Integrations, allowing users to log runs from Ray to [Comet](https://www.comet.ml/site/).

Co-authored-by: Michael Cullan <mjcullan@gmail.com>
Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-25 11:42:00 -08:00
Clark Zinzow
1971a08b7d
[RFC] [Core] Support disabling log redirection via RAY_LOG_TO_STDERR environment variable. (#21767) 2022-01-25 10:52:53 -08:00
Gagandeep Singh
395297a9bd
Unskip tests for Windows in test_output (#21775) 2022-01-25 09:25:01 -08:00
Matti Picus
d3d1e8559c
enable passing metric tests on windows (#21755)
Resubmitting #21705 which was merged then reverted. It seems somehow sphinx building broke in the meantime, not clear how it is connected to this PR.

Here is the original description:
>Part of the effort to enable tests on windows, this enables test_metrics and test_metric_agents, which pass locally.
2022-01-25 09:20:16 -08:00
SangBin Cho
b2cd123522
[Runtime Env] Suppress the log messages when RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=0 (#21806)
There was a user request to disable runtime env logs. This is the first PR that allows users to disable runtime env logs through an env var. Basically if users specify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED =0`, this will disable runtime env logs. 

Note that in the log monitor RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED=1 by default. This is temporary, and I'd like to make this 0 by default after improving runtime error failure messages. 

Once we disable log msgs by default, we can unify `RAY_RUNTIME_ENV_LOG_TO_DRIVER_ENABLED` and `RAY_RUNTIME_ENV_LOCAL_DEV_MODE`
2022-01-25 00:42:52 -08:00
Gagandeep Singh
290f3172ad
Unskipped tests for Windows in test_client.py (#21824)
All the tests in `test_client.py` pass on Windows without issues, so unskipping them here.
2022-01-24 22:51:54 -08:00
Lixin Wei
bc55a958c4
[Core] Support UTF-8 Actor Creation Exceptions (#21807)
Now if an actor throws an exception containing non-ASCII characters, the actor won't die and will be alive.

This is because the following exception occurred during handling the user's exception:
```
  File "python/ray/_raylet.pyx", line 587, in ray._raylet.task_execution_handler
  File "python/ray/_raylet.pyx", line 434, in ray._raylet.execute_task
  File "python/ray/_raylet.pyx", line 551, in ray._raylet.execute_task
  File "/home/admin/.local/lib/python3.6/site-packages/ray/utils.py", line 96, in push_error_to_driver
    worker.core_worker.push_error(job_id, error_type, message, time.time())
  File "python/ray/_raylet.pyx", line 1636, in ray._raylet.CoreWorker.push_error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2597-2600: ordinal not in range(128)
An unexpected internal error occurred while the worker was executing a task.
```

This PR fixes this issue.
2022-01-24 20:27:43 -08:00
Andrew A. Naguib
f026376556
[Tune] PTL replace deprecated running_sanity_check with sanity_checking (#21831)
`running_sanity_check` was deprecated and removed in https://github.com/PyTorchLightning/pytorch-lightning/pull/9209 in favor of `sanity_checking`
2022-01-24 16:14:05 -08:00
Siyuan (Ryans) Zhuang
99b287d236
[workflow] Fix workflow recovery issue due to a bug of dynamic output (#21571)
* Fix workflow recovery issue due to a bug of dynamic output

* add tests
2022-01-24 15:34:57 -08:00
DK.Pino
c2199a50e3
[Placement Group] Fix remove pg flaky when worker startup slow (#20474)
Currently, when we destroy the created placement group, we will kill all workers that are related to this placement group, however, we only killed the running worker at this time, if there is a worker which startup very slow and the related placement group was already destroyed before the worker startup successfully, then there will be a worker leak.
2022-01-24 15:30:04 -08:00
mwtian
a10d05ce27
[Bootstrap] fix log format (#21826) 2022-01-24 15:06:41 -08:00
Yi Cheng
57afb2f75a
[gcs/ha] Skip raydb test when it's gcs bootstrap mode (#21771)
RayDP needs to be updated to work with redisless ray.
To be more specific this [line](c08a786770/python/raydp/spark/ray_cluster_master.py (L146)
) needs to be updated to using `node.address`

We should update this after the release with the feature being turned on by default.
2022-01-24 14:43:31 -08:00
shrekris-anyscale
03d93ba7ee
Add a new End-to-End tutorial in Serve that walks users through deploying a model (#20765)
Currently, the docs have an [end-to-end tutorial](https://web.archive.org/web/20211122152843/https://docs.ray.io/en/latest/serve/tutorial.html) walking users through deploying a `Counter` function on Serve. This PR adds an end-to-end tutorial walking users through deploying an entire Hugging Face model using Serve, providing a better understanding of how to deploy an actual model via Serve.

Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
2022-01-24 16:36:04 -06:00
Sven Mika
c288b97e5f
[RLlib] Issue 21629: Video recorder env wrapper not working. Added test case. (#21670) 2022-01-24 19:38:21 +01:00
Antoni Baum
850eb88cde
[tune] Fix analysis without registered trainable (#21475)
This PR fixes issues with loading ExperimentAnalysis from path or pickle if the trainable used in the trials is not registered. Chiefly, it ensures that the stub attribute set in load_trials_from_experiment_checkpoint doesn't get overridden by the state of the loaded trial, and that when pickling, all trials in ExperimentAnalysis are turned into stubs if they aren't already. A test has also been added.
2022-01-24 08:27:08 -08:00
Guyang Song
f8e41215b3
[1/n][cross-language runtime env] runtime env protobuf refactor (#21551)
We need to support runtime env for java、c++ and cross-language. This PR only do a refactor of protobuf.
Related issue #21731
2022-01-24 19:24:59 +08:00
Chen Shen
a60251f47a
[Core] Fix 16GB mac perf issue by limit the plasma store size to 2GB (#21224)
* add changes

* as title

* fix

* max to min

* fix tests
2022-01-24 01:52:59 -08:00
Lingxuan Zuo
ec62d7f510
[Streaming]Farewell : remove all of streaming related from ray repo. (#21770)
New repo url is https://github.com/ray-project/mobius

Co-authored-by: 林濯 <lingxuzn.zlx@antgroup.com>
2022-01-23 17:53:41 +08:00
Gagandeep Singh
2da2ac52ce
Unskipped test_worker_stdout (#21708) 2022-01-22 02:43:03 -08:00
Qing Wang
a37d9a2ec2
[Core] Support default actor lifetime. (#21283)
Support the ability to specify a default lifetime for actors which are not specified lifetime when creating. This is a job level configuration item.
#### API Change
The Python API looks like:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
```

Java API looks like:
```java
  System.setProperty("ray.job.default-actor-lifetime", defaultActorLifetime.name());
  Ray.init();
```

One example usage is:
```python
  ray.init(job_config=JobConfig(default_actor_lifetime="detached"))
  a1 = A.options(lifetime="non_detached").remote()   # a1 is a non-detached actor.
  a2 = A.remote()  # a2 is a non-detached actor.
```

Co-authored-by: Kai Yang <kfstorm@outlook.com>
Co-authored-by: Qing Wang <jovany.wq@antgroup.com>
2022-01-22 12:26:08 +08:00
Gagandeep Singh
b00385f9a2
Using a deterministic approach to check connections in test_job_timestamps (#21693) 2022-01-21 20:18:05 -08:00
xwjiang2010
0abcd5eea5
[tune] only sync up and sync down checkpoint folder for cloud checkpoint. (#21658)
By default, ~/ray_results/exp_name/trial_name/checkpoint_name.
Instead of the whole trial checkpoint (~/ray_results/exp_name/trial_name/) directory.
Stuff like progress.csv, result.json, params.pkl, params.json, events.out etc are coming from driver process.
This could also enable us to de-couple sync up and delete - they don't have to wait for each other to finish.
2022-01-21 17:56:05 -08:00
matthewdeng
8119b62640
[train] refactor callback logdir and results preprocessors (#21468)
* [train] Add TorchTensorboardProfilerCallback and introduce ResultsPreprocessors

* simplify profiler

* read on get_and_clear_profile_traces

* refactor callbacks

* remove var

* Update python/ray/train/callbacks/logging.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* Update python/ray/train/callbacks/results_prepocessors/keys.py

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>

* address comments; add tests

* fix test

* address comments

* docs

* address comments'

* fix test

Co-authored-by: Antoni Baum <antoni.baum@protonmail.com>
2022-01-21 17:23:34 -08:00
Junwen Yao
216c4bf9a6
[Serve] warn when serve.start() with different options (#21562) 2022-01-21 15:51:26 -08:00
shrekris-anyscale
45eebdd6e3
[Serve] Handle name collisions when unzipping directories in unzip_package (#21723)
Currently, the `unzip_package` function relies on `extract_file_and_remove_top_level_dir` to unzip and remove the top-level directory from archive working directories. However, `extract_file_and_remove_top_level_dir` uses `os.rename()` to remove the tld by manually unzipping each file from a zip file and moving it to the tld's parent. When the tld contains directories or files with the same name as the tld, `os.rename()` fails to move these files to the tld's parent because of the name collision between the file and the tld.

This change replaces `extract_file_and_remove_top_level_dir` with `remove_dir_from_filepaths`. Now, `unzip_package` unzips the entire zip file before `remove_dir_from_filepaths` moves all the tld's children to the tld's parent using `os.rename()`.

This edge case is tested in the new unit test `test_unzip_with_matching_subdirectory_names`. Additionally, `extract_file_and_remove_top_level_dir`'s unit test is replaced with `TestRemoveDirFromFilepaths`, which tests the new `remove_dir_from_filepaths` function.
2022-01-21 15:27:28 -06:00
shrekris-anyscale
75b3080834
[Serve] Serve Autoscaling Release tests (#21208) 2022-01-21 12:08:25 -08:00
mwtian
c85546a884
[Test] increase timeout for test_traceback.py (#21765)
`test_traceback.py` was taking ~55s to finish recently, and since today it starts to time out at 60s more frequently. All test cases do succeed so increase its test time out for now. We will look into if there is any performance regression separately.
2022-01-20 23:57:49 -08:00
SangBin Cho
5514711a35
[Part 5] Set actor died error message in ActorDiedError (#20903)
This is the second last PR to improve `ActorDiedError` exception. 

This propagates Actor death cause metadata to the ray error object. In this way, we can raise a better actor died error exception.

After this PR is merged, I will add more metadata to each error message and write a documentation that explains when each error happens. 

TODO
- [x] Fix test failures
- [x] Add unit tests
- [x] Fix Java/cpp cases

Follow up PRs
- Not allowing nullptr for RayErrorInfo input.
2022-01-20 22:11:11 -08:00
mwtian
0dbe4b3a56
[Pubsub] fix driver warning for not keeping up with worker logs (#21717)
GCS pubsub uses long polling, so the subscriber waits instead of returning None from polling when there is no buffered log. It needs a different heuristic to decide if the driver is not keeping up with logs from the worker.
2022-01-20 16:32:42 -08:00
xwjiang2010
9af8f11191
Revert "[docs] Clean up doc structure (first part) (#21667)" (#21763)
This reverts commit 38e46c9fb3.
2022-01-20 15:30:56 -08:00
Yi Cheng
3c63a8410d
[gcs/ha] Fix java related error when enable redisless ray (#21692)
This PR enables ray java to be able to run without redis. It also fixes java related tests and updated the pipeline.
2022-01-20 13:56:25 -08:00
Jiajun Yao
fa9feb5033
Fix replace_symlinks_with_junctions for windows (#21720)
Windows cmd.exe doesn't interpret single quote correctly. See https://github.com/conda-forge/ray-packages-feedstock/pull/43
2022-01-20 12:38:56 -08:00
matthewdeng
976ba5dbfe
[train] fix fashion mnist example (#21689)
Making some minor fixes.

1. Update input `batch_size` to be global batch size. Introduce `worker_batch_size` so each iteration trains same global batch size.
2. Update dataset `size` calculation to only refer to the fraction of the data that is trained on each worker. This allows calculations (e.g. training progress, accuracy) to be correct.
3. Add `model.train()` for generality.
4. Remove `smoke-test` flag since it's not really being used.
2022-01-20 12:26:02 -08:00
SangBin Cho
b6d3e01e0b
Revert "WINDOWS: enable passing metric tests (#21705)" (#21738)
This reverts commit 8104fd5c76.
2022-01-20 07:27:49 -08:00
Max Pumperla
38e46c9fb3
[docs] Clean up doc structure (first part) (#21667) 2022-01-20 16:19:04 +01:00
mwtian
a4581e58ee
[Pubsub] improve error handling for GCS AIO subscribers in dashboard (#21712)
- Tolerate GRPC deadline exceeded and transient failures in Python GCS AIO subscribers, which becomes consistent with Python GCS synchronous subscribers.
- Tolerate any exception in dashboard for subscribing to logs and error info, which becomes consistent with how dashboard handles GRPC errors for obtaining node stats.
2022-01-20 07:04:54 -08:00
Hao Chen
8dcc07ec9c
[Fix][Locality] ref count should remove object locations for dead nodes (#21548)
When a node is dead, reference table should remove locations for those objects on the node. Otherwise locality-aware scheduling will schedule tasks to the dead node.
2022-01-20 11:58:52 +08:00
Philipp Moritz
fbc51d6d0e
[Kuberay] Ray Autoscaler integration with Kuberay (MVP) (#21086)
This is a minimum viable product for Ray Autoscaler integration with Kuberay. It is not ready for prime time/general use, but should be enough for interested parties to get started (see the documentation in kuberay.md).
2022-01-19 19:42:17 -08:00
Wilson Wang
2626c64060
Fix monitor.py exceptions. Enable fetching GCS address from Redis with retries. (#21533)
GCS, when running as an individual component, can cause other components to fail in case of crashes. 

Here are two main cases covered in this patch:

1. monitor.py will raise an exception when disconnected from GCS.
2. When GCS becomes available later than other components, the missing KV of GCS address can cause other components to fail to start.


In our patch, we fixed these two issues as well as increased the timeout for redis connection which was too small.

Co-authored-by: Mingwei Tian <mwtian@anyscale.com>
2022-01-19 18:48:03 -08:00
Matti Picus
8104fd5c76
WINDOWS: enable passing metric tests (#21705) 2022-01-19 17:09:34 -08:00
Eric Liang
88143cdc35
[data] Unify key function type and error handling across sort, groupby, and agg (#21627)
Prior to this PR, sort, groupby, and aggregate defined separate types for extracting values from Dataset records. This was confusing since the user had to understand the differences between the different key types (which were basically exactly the same).

This PR defines a common key type: KeyFn, which is simply Union[None, str, Callable[[T], Any]]. This is used as sort(KeyFn...), aggregate(Agg(KeyFn)...), groupby(KeyFn).agg(Agg(KeyFn), ...).

It also unifies the error generation paths to a common _validate_key_fn utility. This also improves the errors generated when passing explicit AggregateFn classes, which previously failed in the workers if invalid.
2022-01-19 11:15:13 -08:00
Yi Cheng
82103bf7c1
[gcs/ha] Fix cpp tests related to redis removal (#21628)
This PR fixed cpp tests and also make ray cpp able to pass.
2022-01-19 01:26:34 -08:00