summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authoraktersnurra <gustaf.rydholm@gmail.com>2020-09-08 23:14:23 +0200
committeraktersnurra <gustaf.rydholm@gmail.com>2020-09-08 23:14:23 +0200
commite1b504bca41a9793ed7e88ef14f2e2cbd85724f2 (patch)
tree70b482f890c9ad2be104f0bff8f2172e8411a2be /README.md
parentfe23001b6588e6e6e9e2c5a99b72f3445cf5206f (diff)
IAM datasets implemented.
Diffstat (limited to 'README.md')
-rw-r--r--README.md60
1 files changed, 50 insertions, 10 deletions
diff --git a/README.md b/README.md
index 844f2e0..5181386 100644
--- a/README.md
+++ b/README.md
@@ -17,7 +17,7 @@ TBC
- [x] Fix basic test to load model
- [x] Fix loading previous experiments
- [x] Able to set verbosity level on the logger to terminal output
-- [ ] Implement Callbacks for training
+- [x] Implement Callbacks for training
- [x] Implement early stopping
- [x] Implement wandb
- [x] Implement lr scheduler as a callback
@@ -25,9 +25,9 @@ TBC
- [x] Implement TQDM progress bar (Low priority)
- [ ] Check that dataset exists, otherwise download it form the web. Do this in run_experiment.py.
- [x] Create repr func for data loaders
-- [ ] Be able to restart with lr scheduler (May skip this BS)
+- [ ] Be able to restart with lr scheduler (May skip this)
- [ ] Implement population based training
-- [ ] Implement Bayesian hyperparameter search (with W&B maybe)
+- [x] Implement Bayesian hyperparameter search (with W&B maybe)
- [x] Try to fix shell cmd security issues S404, S602
- [x] Change prepare_experiment.py to print statements st it can be run with tasks/prepare_sample_experiments.sh | parallel -j1
- [x] Fix caption in WandbImageLogger
@@ -38,10 +38,50 @@ TBC
- [x] Finish Emnist line dataset
- [x] SentenceGenerator
- [x] Write a Emnist line data loader
-- [ ] Implement ctc line model
- - [ ] Implement CNN encoder (ResNet style)
- - [ ] Implement the RNN + output layer
- - [ ] Construct/implement the CTC loss
-- [ ] Sweep base config yaml file
-- [ ] sweep.py
-- [ ] sweep.yaml
+- [x] Implement ctc line model
+ - [x] Implement CNN encoder (ResNet style)
+ - [x] Implement the RNN + output layer
+ - [x] Construct/implement the CTC loss
+- [x] Sweep base config yaml file
+- [x] sweep.py
+- [x] sweep.yaml
+- [x] Fix dataset splits.
+- [x] Implement predict on image
+- [x] CTC decoder
+- [x] IAM dataset
+- [x] IAM Lines dataset
+- [x] IAM paragraphs dataset
+- [ ] Visual attention:
+ - [ ] Enriched Deep Recurrent Visual Attention Model for Multiple Object Recognition
+ - [ ] DRAM (maybe)
+ - [ ] Dynamic Capacity Network
+- [ ] CNN + Transformer
+- [ ] fix nosec problem
+
+## Run Sweeps
+ Run the following commands to execute hyperparameter search with W&B:
+
+```
+wandb sweep training/sweep_emnist_resnet.yml
+export SWEEP_ID=...
+wandb agent $SWEEP_ID
+
+```
+
+## PyTorch Performance Guide
+Tips and tricks from ["PyTorch Performance Tuning Guide - Szymon Migacz, NVIDIA"](https://www.youtube.com/watch?v=9mS1fIYj1So&t=125s):
+
+* Always better to use `num_workers > 0`, allows asynchronous data processing
+* Use `pin_memory=True` to allow data loading and computations to happen on the GPU in parallel.
+* Have to tune `num_workers` to use based on the problem, too many and data loading becomes slower.
+* For CNNs use `torch.backends.cudnn.benchmark=True`, allows cuDNN to select the best algorithm for convolutional computations (autotuner).
+* Increase batch size to max out GPU memory.
+* Use optimizer for large batch training, e.g. LARS, LAMB etc.
+* Set `bias=False` for convolutions directly followed by BatchNorm.
+* Use `for p in model.parameters(): p.grad = None` instead of `model.zero_grad()`.
+* Careful with disable debug APIs in prod (detect_anomaly, profiler, gradcheck).
+* Use `DistributedDataParallel` not `DataParallel`, uses 1 CPU core for each GPU.
+* Important to load balance compute on all GPUs, if variably-sized inputs or GPUs will idle.
+* Use an apex fused optimizer
+* Use checkpointing to recompute memory-intensive compute-efficient ops in backward pass (e.g. activations, upsampling), `torch.utils.checkpoint`.
+* Use `@torch.jit.script`, especially to fuse long sequences of pointwise operations like GELU.