ray/doc/examples/lm/ray_train.sh

#!/bin/bash

TOTAL_UPDATES=125000       # Total number of training steps
WARMUP_UPDATES=10000       # Warmup the learning rate over this many updates
PEAK_LR=0.0005             # Peak learning rate, adjust as needed
TOKENS_PER_SAMPLE=512      # Max sequence length
#MAX_POSITIONS=512         # Num. positional embeddings (usually same as above)
MAX_SENTENCES=8            # Number of sequences per batch on one GPU (batch size)
FIX_BATCH_SIZE=2048        # Number of batch size in total (max_sentences * update_freq * n_gpus)
SAVE_INTERVAL_UPDATES=1000 # save a checkpoint every N updates

LOG_DIR=$HOME/efs/lm/log/
DATA_DIR=$HOME/efs/lm/data-bin/wikitext-103/
mkdir -p "$LOG_DIR"

python "$HOME"/efs/lm/ray_train.py --fp16 "$DATA_DIR" \
    --task masked_lm --criterion masked_lm \
    --arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \
    --optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \
    --lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \
    --dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \
    --max-sentences $MAX_SENTENCES \
    --fix-batch-size $FIX_BATCH_SIZE \
    --max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \
    --save-interval-updates $SAVE_INTERVAL_UPDATES \
    --save-dir "$LOG_DIR" --ddp-backend=no_c10d
[docs] add pages about examples on training language models with fairseq (#5755) * add pages about examples on training language models with fairseq and ray autoscaler * better format * update ray_train.sh * Move EFS to the autoscaler file * nits * add comments to the code & use a new way to implement checkpoint hook * small bug fix * polish the doc * fix formatting * yaml * update docs * fix the bugs and add preprocess.sh * fix lint * Reduce batch size & fix lint * shorttitle 2019-10-20 20:28:16 -07:00			`#!/bin/bash`

			`TOTAL_UPDATES=125000 # Total number of training steps`
			`WARMUP_UPDATES=10000 # Warmup the learning rate over this many updates`
			`PEAK_LR=0.0005 # Peak learning rate, adjust as needed`
			`TOKENS_PER_SAMPLE=512 # Max sequence length`
Shellcheck rewrites (#9597) * Fix SC2001: See if you can use ${variable//search/replace} instead. * Fix SC2010: Don't use ls \| grep. Use a glob or a for loop with a condition to allow non-alphanumeric filenames. * Fix SC2012: Use find instead of ls to better handle non-alphanumeric filenames. * Fix SC2015: Note that A && B \|\| C is not if-then-else. C may run when A is true. * Fix SC2028: echo may not expand escape sequences. Use printf. * Fix SC2034: variable appears unused. Verify use (or export if used externally). * Fix SC2035: Use ./glob or -- glob so names with dashes won't become options. * Fix SC2071: > is for string comparisons. Use -gt instead. * Fix SC2154: variable is referenced but not assigned * Fix SC2164: Use 'cd ... \|\| exit' or 'cd ... \|\| return' in case cd fails. * Fix SC2188: This redirection doesn't have a command. Move to its command (or use 'true' as no-op). * Fix SC2236: Use -n instead of ! -z. * Fix SC2242: Can only exit with status 0-255. Other data should be written to stdout/stderr. * Fix SC2086: Double quote to prevent globbing and word splitting. Co-authored-by: Mehrdad <noreply@github.com> 2020-07-24 15:24:19 -07:00			`#MAX_POSITIONS=512 # Num. positional embeddings (usually same as above)`
[docs] add pages about examples on training language models with fairseq (#5755) * add pages about examples on training language models with fairseq and ray autoscaler * better format * update ray_train.sh * Move EFS to the autoscaler file * nits * add comments to the code & use a new way to implement checkpoint hook * small bug fix * polish the doc * fix formatting * yaml * update docs * fix the bugs and add preprocess.sh * fix lint * Reduce batch size & fix lint * shorttitle 2019-10-20 20:28:16 -07:00			`MAX_SENTENCES=8 # Number of sequences per batch on one GPU (batch size)`
			`FIX_BATCH_SIZE=2048 # Number of batch size in total (max_sentences * update_freq * n_gpus)`
			`SAVE_INTERVAL_UPDATES=1000 # save a checkpoint every N updates`

			`LOG_DIR=$HOME/efs/lm/log/`
			`DATA_DIR=$HOME/efs/lm/data-bin/wikitext-103/`
Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> 2020-07-21 19:56:41 -07:00			`mkdir -p "$LOG_DIR"`
[docs] add pages about examples on training language models with fairseq (#5755) * add pages about examples on training language models with fairseq and ray autoscaler * better format * update ray_train.sh * Move EFS to the autoscaler file * nits * add comments to the code & use a new way to implement checkpoint hook * small bug fix * polish the doc * fix formatting * yaml * update docs * fix the bugs and add preprocess.sh * fix lint * Reduce batch size & fix lint * shorttitle 2019-10-20 20:28:16 -07:00
Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> 2020-07-21 19:56:41 -07:00			`python "$HOME"/efs/lm/ray_train.py --fp16 "$DATA_DIR" \`
[docs] add pages about examples on training language models with fairseq (#5755) * add pages about examples on training language models with fairseq and ray autoscaler * better format * update ray_train.sh * Move EFS to the autoscaler file * nits * add comments to the code & use a new way to implement checkpoint hook * small bug fix * polish the doc * fix formatting * yaml * update docs * fix the bugs and add preprocess.sh * fix lint * Reduce batch size & fix lint * shorttitle 2019-10-20 20:28:16 -07:00			`--task masked_lm --criterion masked_lm \`
			`--arch roberta_base --sample-break-mode complete --tokens-per-sample $TOKENS_PER_SAMPLE \`
			`--optimizer adam --adam-betas '(0.9, 0.98)' --adam-eps 1e-6 --clip-norm 0.0 \`
			`--lr-scheduler polynomial_decay --lr $PEAK_LR --warmup-updates $WARMUP_UPDATES --total-num-update $TOTAL_UPDATES \`
			`--dropout 0.1 --attention-dropout 0.1 --weight-decay 0.01 \`
			`--max-sentences $MAX_SENTENCES \`
			`--fix-batch-size $FIX_BATCH_SIZE \`
			`--max-update $TOTAL_UPDATES --log-format simple --log-interval 1 \`
			`--save-interval-updates $SAVE_INTERVAL_UPDATES \`
Shellcheck quoting (#9596) * Fix SC2006: Use $(...) notation instead of legacy backticked `...`. * Fix SC2016: Expressions don't expand in single quotes, use double quotes for that. * Fix SC2046: Quote this to prevent word splitting. * Fix SC2053: Quote the right-hand side of == in [[ ]] to prevent glob matching. * Fix SC2068: Double quote array expansions to avoid re-splitting elements. * Fix SC2086: Double quote to prevent globbing and word splitting. * Fix SC2102: Ranges can only match single chars (mentioned due to duplicates). * Fix SC2140: Word is of the form "A"B"C" (B indicated). Did you mean "ABC" or "A\"B\"C"? * Fix SC2145: Argument mixes string and array. Use * or separate argument. * Fix SC2209: warning: Use var=$(command) to assign output (or quote to assign string). Co-authored-by: Mehrdad <noreply@github.com> 2020-07-21 19:56:41 -07:00			`--save-dir "$LOG_DIR" --ddp-backend=no_c10d`