mirror of
https://github.com/vale981/ray
synced 2025-03-10 21:36:39 -04:00
123 lines
3.6 KiB
ReStructuredText
123 lines
3.6 KiB
ReStructuredText
![]() |
.. _gotchas:
|
|||
|
|
|||
|
Ray Gotchas
|
|||
|
===========
|
|||
|
|
|||
|
Ray sometimes has some aspects of its behavior that might catch
|
|||
|
users off guard. There may be sound arguments for these design choices.
|
|||
|
|
|||
|
In particular, users think of Ray as running on their local machine, and
|
|||
|
while this is mostly true, this doesn't work.
|
|||
|
|
|||
|
Environment variables are not passed from the driver to workers
|
|||
|
---------------------------------------------------------------
|
|||
|
|
|||
|
**Issue**: If you set an environment variable at the command line, it is not passed to all the workers running in the cluster
|
|||
|
if the cluster was started previously.
|
|||
|
|
|||
|
**Example**: If you have a file ``baz.py`` in the directory you are running Ray in, and you run the following command:
|
|||
|
|
|||
|
.. literalinclude:: doc_code/gotchas.py
|
|||
|
:language: python
|
|||
|
:start-after: __env_var_start__
|
|||
|
:end-before: __env_var_end__
|
|||
|
|
|||
|
**Expected behavior**: Most people would expect (as if it was a single process on a single machine) that the environment variables would be the same in all workers. It won’t be.
|
|||
|
|
|||
|
**Fix**: Use runtime environments to pass environment variables explicity.
|
|||
|
If you call ``ray.init(runtime_env=...)``,
|
|||
|
then the workers will have the environment variable set.
|
|||
|
|
|||
|
|
|||
|
.. literalinclude:: doc_code/gotchas.py
|
|||
|
:language: python
|
|||
|
:start-after: __env_var_fix_start__
|
|||
|
:end-before: __env_var_fix_end__
|
|||
|
|
|||
|
|
|||
|
Filenames work sometimes and not at other times
|
|||
|
-----------------------------------------------
|
|||
|
|
|||
|
**Issue**: If you reference a file by name in a task or actor,
|
|||
|
it will sometimes work and sometimes fail. This is
|
|||
|
because if the task or actor runs on the head node
|
|||
|
of the cluster, it will work, but if the task or actor
|
|||
|
runs on another machine it won't.
|
|||
|
|
|||
|
**Example**: Let's say we do the following command:
|
|||
|
|
|||
|
.. code-block:: bash
|
|||
|
|
|||
|
% touch /tmp/foo.txt
|
|||
|
|
|||
|
And I have this code:
|
|||
|
|
|||
|
.. code-block:: python
|
|||
|
|
|||
|
import os
|
|||
|
|
|||
|
ray.init()
|
|||
|
@ray.remote
|
|||
|
def check_file():
|
|||
|
foo_exists = os.path.exists("/tmp/foo.txt")
|
|||
|
print(f"Foo exists? {foo_exists}")
|
|||
|
|
|||
|
futures = []
|
|||
|
for _ in range(1000):
|
|||
|
futures.append(check_file.remote())
|
|||
|
|
|||
|
ray.get(futures)
|
|||
|
|
|||
|
|
|||
|
then you will get a mix of True and False. If
|
|||
|
``check_file()`` runs on the head node, or we're running
|
|||
|
locally it works. But if it runs on a worker node, it returns ``False``.
|
|||
|
|
|||
|
**Expected behavior**: Most people would expect this to either fail or succeed consistently.
|
|||
|
It's the same code after all.
|
|||
|
|
|||
|
**Fix**
|
|||
|
|
|||
|
- Use only shared paths for such applications -- e.g. if you are using a network file system you can use that, or the files can be on s3.
|
|||
|
- Do not rely on file path consistency.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Placement groups are not composable
|
|||
|
-----------------------------------
|
|||
|
|
|||
|
**Issue**: If you have a task that is called from something that runs in a placement
|
|||
|
group, the resources are never allocated and it hangs.
|
|||
|
|
|||
|
**Example**: You are using Ray Tune which creates placement groups, and you want to
|
|||
|
apply it to an objective function, but that objective function makes use
|
|||
|
of Ray Tasks itself, e.g.
|
|||
|
|
|||
|
.. code-block:: python
|
|||
|
|
|||
|
def create_task_that_uses_resources():
|
|||
|
@ray.remote(num_cpus=10)
|
|||
|
def sample_task():
|
|||
|
print("Hello")
|
|||
|
return
|
|||
|
|
|||
|
return ray.get([my_task.remote() for i in range(10)])
|
|||
|
|
|||
|
def objective(config):
|
|||
|
create_task_that_uses_resources()
|
|||
|
|
|||
|
analysis = tune.run(objective, config=search_space)
|
|||
|
|
|||
|
This will hang forever.
|
|||
|
|
|||
|
**Expected behavior**: The above executes and doesn't hang.
|
|||
|
|
|||
|
**Fix**: In the ``@ray.remote`` declaration of tasks
|
|||
|
called by ``create_task_that_uses_resources()`` , include a
|
|||
|
``placement_group=None``.
|
|||
|
|
|||
|
.. code-block:: diff
|
|||
|
|
|||
|
def create_task_that_uses_resources():
|
|||
|
+ @ray.remote(num_cpus=10, placement_group=None)
|
|||
|
- @ray.remote(num_cpus=10)
|