mirror of
https://github.com/vale981/ray
synced 2025-03-06 02:21:39 -05:00
[tune,autoscaler] Test yaml, add better distributed docs (#5403)
This commit is contained in:
parent
1f8ae17f60
commit
ed89897a31
10 changed files with 184 additions and 130 deletions
|
@ -1,14 +1,27 @@
|
|||
Tune Distributed Experiments
|
||||
============================
|
||||
|
||||
Tune is commonly used for large-scale distributed hyperparameter optimization. Tune and Ray provide many utilities that enable an effective workflow for interacting with a cluster, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
|
||||
Tune is commonly used for large-scale distributed hyperparameter optimization. This page will overview:
|
||||
|
||||
This page will overview the tooling for distributed experiments, covering how to connect to a cluster, how to launch a distributed experiment, and commonly used commands.
|
||||
1. How to setup and launch a distributed experiment,
|
||||
2. `commonly used commands <tune-distributed.html#common-commands>`_, including fast file mounting, one-line cluster launching, and result uploading to cloud storage.
|
||||
|
||||
Connecting to a cluster
|
||||
-----------------------
|
||||
**Quick Summary**: To run a distributed experiment with Tune, you need to:
|
||||
|
||||
One common approach to modifying an existing Tune experiment to go distributed is to set an `argparse` variable so that toggling between distributed and single-node is seamless. This allows Tune to utilize all the resources available to the Ray cluster.
|
||||
1. Make sure your script has ``ray.init(redis_address=...)`` to connect to the existing Ray cluster.
|
||||
2. If a ray cluster does not exist, start a Ray cluster (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
|
||||
3. Run the script on the head node (or use ``ray submit``).
|
||||
|
||||
Running a distributed experiment
|
||||
--------------------------------
|
||||
|
||||
Running a distributed (multi-node) experiment requires Ray to be started already. You can do this on local machines or on the cloud (instructions for `local machines <tune-distributed.html#local-cluster-setup>`_, `cloud <tune-distributed.html#launching-a-cloud-cluster>`_).
|
||||
|
||||
Across your machines, Tune will automatically detect the number of GPUs and CPUs without you needing to manage ``CUDA_VISIBLE_DEVICES``.
|
||||
|
||||
To execute a distributed experiment, call ``ray.init(redis_address=XXX)`` before ``tune.run``, where ``XXX`` is the Ray redis address, which defaults to ``localhost:6379``. The Tune python script should be executed only on the head node of the Ray cluster.
|
||||
|
||||
One common approach to modifying an existing Tune experiment to go distributed is to set an ``argparse`` variable so that toggling between distributed and single-node is seamless.
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
|
@ -20,58 +33,50 @@ One common approach to modifying an existing Tune experiment to go distributed i
|
|||
args = parser.parse_args()
|
||||
ray.init(redis_address=args.ray_redis_address)
|
||||
|
||||
Note that connecting to cluster requires a pre-existing Ray cluster to be started already (`Manual Cluster Setup <using-ray-on-a-cluster.html>`_). The script should be run on the head node of the Ray cluster. Below, ``tune_script.py`` can be any script that runs a Tune hyperparameter search.
|
||||
tune.run(...)
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Single-node execution
|
||||
$ python tune_script.py
|
||||
|
||||
# On the head node, connect to an existing ray cluster
|
||||
$ python tune_script.py --ray-redis-address=localhost:XXXX
|
||||
|
||||
.. literalinclude:: ../../python/ray/tune/examples/mnist_pytorch.py
|
||||
:language: python
|
||||
:start-after: if __name__ == "__main__":
|
||||
If you used a cluster configuration (starting a cluster with ``ray up`` or ``ray submit --start``), use:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
Launching a cloud cluster
|
||||
-------------------------
|
||||
ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"
|
||||
|
||||
.. tip::
|
||||
|
||||
If you have already have a list of nodes, skip down to the `Local Cluster Setup`_ section.
|
||||
|
||||
Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_.
|
||||
|
||||
.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
|
||||
:language: yaml
|
||||
:name: tune-default.yaml
|
||||
|
||||
This code starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray submit tune-default.yaml tune_script.py --start
|
||||
|
||||
.. image:: images/tune-upload.png
|
||||
:scale: 50%
|
||||
:align: center
|
||||
|
||||
Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Go to http://localhost:6006 to access TensorBoard.
|
||||
ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
|
||||
|
||||
|
||||
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
|
||||
1. In the examples, the Ray redis address commonly used is ``localhost:6379``.
|
||||
2. If the Ray cluster is already started, you should not need to run anything on the worker nodes.
|
||||
|
||||
Local Cluster Setup
|
||||
-------------------
|
||||
|
||||
If you run into issues (or want to add nodes manually), you can use the manual cluster setup `documentation here <using-ray-on-a-cluster.html>`__. At a glance, On the head node, run the following.
|
||||
If you have already have a list of nodes, you can follow the local private cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_. Below is an example cluster configuration as ``tune-default.yaml``:
|
||||
|
||||
.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
|
||||
:language: yaml
|
||||
|
||||
``ray up`` starts Ray on the cluster of nodes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray up tune-default.yaml
|
||||
|
||||
``ray submit`` uploads ``tune_script.py`` to the cluster and runs ``python tune_script.py [args]``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray submit tune-default.yaml tune_script.py --args="--ray-redis-address=localhost:6379"
|
||||
|
||||
Manual Local Cluster Setup
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
|
||||
If you run into issues using the local cluster setup (or want to add nodes manually), you can use the manual cluster setup. `Full documentation here <using-ray-on-a-cluster.html>`__. At a glance,
|
||||
|
||||
**On the head node**:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
|
@ -86,8 +91,51 @@ The command will print out the address of the Redis server that was started (and
|
|||
|
||||
$ ray start --redis-address=<redis-address>
|
||||
|
||||
If you have already have a list of nodes, you can follow the private autoscaling cluster setup `instructions here <autoscaling.html#quick-start-private-cluster>`_.
|
||||
Then, you can run your Tune Python script on the head node like:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# On the head node, execute using existing ray cluster
|
||||
$ python tune_script.py --ray-redis-address=<redis-address>
|
||||
|
||||
Launching a cloud cluster
|
||||
-------------------------
|
||||
|
||||
.. tip::
|
||||
|
||||
If you have already have a list of nodes, go to the `Local Cluster Setup`_ section.
|
||||
|
||||
Ray currently supports AWS and GCP. Below, we will launch nodes on AWS that will default to using the Deep Learning AMI. See the `cluster setup documentation <autoscaling.html>`_. Save the below cluster configuration (``tune-default.yaml``):
|
||||
|
||||
.. literalinclude:: ../../python/ray/tune/examples/tune-default.yaml
|
||||
:language: yaml
|
||||
:name: tune-default.yaml
|
||||
|
||||
``ray up`` starts Ray on the cluster of nodes.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray up tune-default.yaml
|
||||
|
||||
``ray submit --start`` starts a cluster as specified by the given cluster configuration YAML file, uploads ``tune_script.py`` to the cluster, and runs ``python tune_script.py [args]``.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray submit tune-default.yaml tune_script.py --start --args="--ray-redis-address=localhost:6379"
|
||||
|
||||
.. image:: images/tune-upload.png
|
||||
:scale: 50%
|
||||
:align: center
|
||||
|
||||
Analyze your results on TensorBoard by starting TensorBoard on the remote head machine.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
# Go to http://localhost:6006 to access TensorBoard.
|
||||
ray exec tune-default.yaml 'tensorboard --logdir=~/ray_results/ --port 6006' --port-forward 6006
|
||||
|
||||
|
||||
Note that you can customize the directory of results by running: ``tune.run(local_dir=..)``. You can then point TensorBoard to that directory to visualize results. You can also use `awless <https://github.com/wallix/awless>`_ for easy cluster management on AWS.
|
||||
|
||||
Pre-emptible Instances (Cloud)
|
||||
------------------------------
|
||||
|
@ -180,7 +228,7 @@ To summarize, here are the commands to run:
|
|||
# wait a while until after all nodes have started
|
||||
ray kill-random-node tune-default.yaml --hard
|
||||
|
||||
You should see Tune eventually continue the trials on a different worker node. See the `Saving and Recovery <tune-usage.html#saving-and-recovery>`__ section for more details.
|
||||
You should see Tune eventually continue the trials on a different worker node. See the `Save and Restore <tune-usage.html#save-and-restore>`__ section for more details.
|
||||
|
||||
You can also specify ``tune.run(upload_dir=...)`` to sync results with a cloud storage like S3, persisting results in case you want to start and stop your cluster automatically.
|
||||
|
||||
|
|
|
@ -35,7 +35,7 @@ Tune includes a distributed implementation of `Population Based Training (PBT) <
|
|||
})
|
||||
tune.run( ... , scheduler=pbt_scheduler)
|
||||
|
||||
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#saving-and-recovery>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
|
||||
When the PBT scheduler is enabled, each trial variant is treated as a member of the population. Periodically, top-performing trials are checkpointed (this requires your Trainable to support `save and restore <tune-usage.html#save-and-restore>`__). Low-performing trials clone the checkpoints of top performers and perturb the configurations in the hope of discovering an even better variation.
|
||||
|
||||
You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, a single trial may see many different hyperparameters over its lifetime, which is recorded in its ``result.json`` file. The following figure generated by the example shows PBT with optimizing a LR schedule over the course of a single experiment:
|
||||
|
||||
|
@ -69,7 +69,7 @@ Compared to the original version of HyperBand, this implementation provides bett
|
|||
HyperBand
|
||||
---------
|
||||
|
||||
.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#saving-and-recovery>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
|
||||
.. note:: Note that the HyperBand scheduler requires your trainable to support saving and restoring, which is described in `Tune User Guide <tune-usage.html#save-and-restore>`__. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
|
||||
|
||||
Tune also implements the `standard version of HyperBand <https://arxiv.org/abs/1603.06560>`__. You can use it as such:
|
||||
|
||||
|
|
|
@ -255,8 +255,8 @@ If your trainable function / class creates further Ray actors or tasks that also
|
|||
}
|
||||
)
|
||||
|
||||
Saving and Recovery
|
||||
-------------------
|
||||
Save and Restore
|
||||
----------------
|
||||
|
||||
When running a hyperparameter search, Tune can automatically and periodically save/checkpoint your model. Checkpointing is used for
|
||||
|
||||
|
@ -296,7 +296,7 @@ Checkpoints will be saved by training iteration to ``local_dir/exp_name/trial_na
|
|||
|
||||
|
||||
Trainable (Trial) Checkpointing
|
||||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
-------------------------------
|
||||
|
||||
Checkpointing assumes that the model state will be saved to disk on whichever node the Trainable is running on. You can checkpoint with three different mechanisms: manually, periodically, and at termination.
|
||||
|
||||
|
@ -349,7 +349,7 @@ The checkpoint will be saved at a path that looks like ``local_dir/exp_name/tria
|
|||
)
|
||||
|
||||
Fault Tolerance
|
||||
~~~~~~~~~~~~~~~
|
||||
---------------
|
||||
|
||||
Tune will automatically restart trials from the last checkpoint in case of trial failures/error (if ``max_failures`` is set), both in the single node and distributed setting.
|
||||
|
||||
|
|
|
@ -48,14 +48,6 @@ If TensorBoard is installed, automatically visualize all trial results:
|
|||
Distributed Quick Start
|
||||
-----------------------
|
||||
|
||||
.. note::
|
||||
|
||||
This assumes that you have already setup your AWS account and AWS credentials (``aws configure``). To run this example, you will need to install the following:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
$ pip install ray torch torchvision filelock
|
||||
|
||||
1. Import and initialize Ray by appending the following to your example script.
|
||||
|
||||
.. code-block:: python
|
||||
|
@ -71,25 +63,35 @@ Distributed Quick Start
|
|||
|
||||
Alternatively, download a full example script here: :download:`mnist_pytorch.py <../../python/ray/tune/examples/mnist_pytorch.py>`
|
||||
|
||||
2. Download an example cluster yaml here: :download:`tune-default.yaml <../../python/ray/tune/examples/tune-default.yaml>`
|
||||
2. Download the following example Ray cluster configuration as ``tune-local-default.yaml`` and replace the appropriate fields:
|
||||
|
||||
.. literalinclude:: ../../python/ray/tune/examples/tune-local-default.yaml
|
||||
:language: yaml
|
||||
|
||||
Alternatively, download it here: :download:`tune-local-default.yaml <../../python/ray/tune/examples/tune-local-default.yaml>`. See `Ray cluster docs here <autoscaling.html>`_.
|
||||
|
||||
3. Run ``ray submit`` like the following.
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
|
||||
ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
|
||||
|
||||
This will start 3 AWS machines and run a distributed hyperparameter search across them. Append ``[--stop]`` to automatically shutdown your nodes afterwards.
|
||||
This will start Ray on all of your machines and run a distributed hyperparameter search across them.
|
||||
|
||||
To summarize, here are the full set of commands:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/examples/mnist_pytorch.py
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-default.yaml
|
||||
ray submit tune-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
|
||||
wget https://raw.githubusercontent.com/ray-project/ray/master/python/ray/tune/tune-local-default.yaml
|
||||
ray submit tune-local-default.yaml mnist_pytorch.py --args="--ray-redis-address=localhost:6379" --start
|
||||
|
||||
|
||||
Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including setting up distributed experiments on local machines, using GCP, adding resilience to spot instance usage, and more.
|
||||
Take a look at the `Distributed Experiments <tune-distributed.html>`_ documentation for more details, including:
|
||||
|
||||
1. Setting up distributed experiments on your local cluster
|
||||
2. Using AWS and GCP
|
||||
3. Spot instance usage/pre-emptible instances, and more.
|
||||
|
||||
Getting Started
|
||||
---------------
|
||||
|
|
34
python/ray/tests/test_autoscaler_yaml.py
Normal file
34
python/ray/tests/test_autoscaler_yaml.py
Normal file
|
@ -0,0 +1,34 @@
|
|||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import os
|
||||
import unittest
|
||||
import yaml
|
||||
|
||||
from ray.autoscaler.autoscaler import fillout_defaults, validate_config
|
||||
from ray.tests.utils import recursive_fnmatch
|
||||
|
||||
RAY_PATH = os.path.abspath(os.path.join(__file__, "../../"))
|
||||
CONFIG_PATHS = recursive_fnmatch(
|
||||
os.path.join(RAY_PATH, "autoscaler"), "*.yaml")
|
||||
|
||||
CONFIG_PATHS += recursive_fnmatch(
|
||||
os.path.join(RAY_PATH, "tune/examples/"), "*.yaml")
|
||||
|
||||
|
||||
class AutoscalingConfigTest(unittest.TestCase):
|
||||
def testValidateDefaultConfig(self):
|
||||
|
||||
for config_path in CONFIG_PATHS:
|
||||
with open(config_path) as f:
|
||||
config = yaml.safe_load(f)
|
||||
config = fillout_defaults(config)
|
||||
try:
|
||||
validate_config(config)
|
||||
except Exception:
|
||||
self.fail("Config did not pass validation test!")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
unittest.main(verbosity=2)
|
|
@ -2,6 +2,7 @@ from __future__ import absolute_import
|
|||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import fnmatch
|
||||
import os
|
||||
import subprocess
|
||||
import sys
|
||||
|
@ -116,3 +117,15 @@ def wait_for_condition(condition_predictor,
|
|||
time_elapsed += retry_interval_ms
|
||||
time.sleep(retry_interval_ms / 1000.0)
|
||||
return False
|
||||
|
||||
|
||||
def recursive_fnmatch(dirpath, pattern):
|
||||
"""Looks at a file directory subtree for a filename pattern.
|
||||
|
||||
Similar to glob.glob(..., recursive=True) but also supports 2.7
|
||||
"""
|
||||
matches = []
|
||||
for root, dirnames, filenames in os.walk(dirpath):
|
||||
for filename in fnmatch.filter(filenames, pattern):
|
||||
matches.append(os.path.join(root, filename))
|
||||
return matches
|
||||
|
|
|
@ -1,51 +1,10 @@
|
|||
# An unique identifier for the head node and workers of this cluster.
|
||||
cluster_name: tune-example
|
||||
|
||||
# The minimum number of workers nodes to launch in addition to the head
|
||||
# node. This number should be >= 0.
|
||||
min_workers: 2
|
||||
|
||||
# The maximum number of workers nodes to launch in addition to the head
|
||||
# node. This takes precedence over min_workers.
|
||||
max_workers: 2
|
||||
|
||||
# Cloud-provider specific configuration.
|
||||
provider:
|
||||
type: aws
|
||||
region: us-west-2
|
||||
# Availability zone(s), comma-separated, that nodes may be launched in.
|
||||
# Nodes are currently spread between zones by a round-robin approach,
|
||||
# however this implementation detail should not be relied upon.
|
||||
availability_zone: us-west-2a,us-west-2b
|
||||
|
||||
# How Ray will authenticate with newly launched nodes.
|
||||
# By default Ray creates a new private keypair, but you can also use your own.
|
||||
auth:
|
||||
ssh_user: ubuntu
|
||||
|
||||
# Provider-specific config for the head node, e.g. instance type.
|
||||
head_node:
|
||||
InstanceType: c5.xlarge
|
||||
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
|
||||
|
||||
# Provider-specific config for worker nodes, e.g. instance type.
|
||||
worker_nodes:
|
||||
InstanceType: c5.xlarge
|
||||
ImageId: ami-0b294f219d14e6a82 # Deep Learning AMI (Ubuntu) Version 21.0
|
||||
|
||||
# Run workers on spot by default. Comment this out to use on-demand.
|
||||
InstanceMarketOptions:
|
||||
MarketType: spot
|
||||
|
||||
# Files or directories to copy to the head and worker nodes. The format is a
|
||||
# dictionary from REMOTE_PATH: LOCAL_PATH, e.g.
|
||||
file_mounts: {
|
||||
# "/path1/on/remote/machine": "/path1/on/local/machine",
|
||||
# "/path2/on/remote/machine": "/path2/on/local/machine",
|
||||
}
|
||||
|
||||
# List of shell commands to run to set up each node.
|
||||
setup_commands:
|
||||
- pip install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.8.0.dev3-cp36-cp36m-manylinux1_x86_64.whl
|
||||
- pip install torch torchvision tabulate tensorboard filelock
|
||||
|
||||
cluster_name: tune-default
|
||||
provider: {type: aws, region: us-west-2}
|
||||
auth: {ssh_user: ubuntu}
|
||||
min_workers: 3
|
||||
max_workers: 3
|
||||
# Deep Learning AMI (Ubuntu) Version 21.0
|
||||
head_node: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
|
||||
worker_nodes: {InstanceType: c5.xlarge, ImageId: ami-0b294f219d14e6a82}
|
||||
setup_commands: # Set up each node.
|
||||
- pip install ray torch torchvision tabulate tensorboard
|
||||
|
|
11
python/ray/tune/examples/tune-local-default.yaml
Normal file
11
python/ray/tune/examples/tune-local-default.yaml
Normal file
|
@ -0,0 +1,11 @@
|
|||
cluster_name: local-default
|
||||
provider:
|
||||
type: local
|
||||
head_ip: YOUR_HEAD_NODE_HOSTNAME
|
||||
worker_ips: [WORKER_NODE_1_HOSTNAME, WORKER_NODE_2_HOSTNAME, ... ]
|
||||
auth: {ssh_user: YOUR_USERNAME, ssh_private_key: ~/.ssh/id_rsa}
|
||||
## Typically for local clusters, min_workers == max_workers.
|
||||
min_workers: 3
|
||||
max_workers: 3
|
||||
setup_commands: # Set up each node.
|
||||
- pip install ray torch torchvision tabulate tensorboard
|
|
@ -10,7 +10,8 @@ import unittest
|
|||
|
||||
import ray
|
||||
from ray import tune
|
||||
from ray.tune.util import recursive_fnmatch, validate_save_restore
|
||||
from ray.tests.utils import recursive_fnmatch
|
||||
from ray.tune.util import validate_save_restore
|
||||
from ray.rllib import _register_all
|
||||
|
||||
|
||||
|
|
|
@ -4,9 +4,7 @@ from __future__ import print_function
|
|||
|
||||
import base64
|
||||
import copy
|
||||
import fnmatch
|
||||
import logging
|
||||
import os
|
||||
import threading
|
||||
import time
|
||||
from collections import defaultdict
|
||||
|
@ -213,18 +211,6 @@ def _from_pinnable(obj):
|
|||
return obj[0]
|
||||
|
||||
|
||||
def recursive_fnmatch(dirpath, pattern):
|
||||
"""Looks at a file directory subtree for a filename pattern.
|
||||
|
||||
Similar to glob.glob(..., recursive=True) but also supports 2.7
|
||||
"""
|
||||
matches = []
|
||||
for root, dirnames, filenames in os.walk(dirpath):
|
||||
for filename in fnmatch.filter(filenames, pattern):
|
||||
matches.append(os.path.join(root, filename))
|
||||
return matches
|
||||
|
||||
|
||||
def validate_save_restore(trainable_cls, config=None, use_object_store=False):
|
||||
"""Helper method to check if your Trainable class will resume correctly.
|
||||
|
||||
|
|
Loading…
Add table
Reference in a new issue