README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85

# Text Recognizer
Implementing the text recognizer project from the course ["Full Stack Deep Learning Course"](https://fullstackdeeplearning.com/march2019) (FSDL) in PyTorch in order to learn best practices when building a deep learning project. I have expanded on this project by adding additional feature and ideas given by Claudio Jolowicz in ["Hypermodern Python"](https://cjolowicz.github.io/posts/hypermodern-python-01-setup/).

## Prerequisite

- [pyenv](https://github.com/pyenv/pyenv) (or similar) and python 3.9.\* installed.

- [nox](https://nox.thea.codes/en/stable/index.html) for linting, formatting, and testing.

- [Poetry](https://python-poetry.org/) is a project manager for python.

## Installation

Install poetry and pyenv.

```sh
pyenv local 3.9.*
make install
```

## Generate Datasets

Download and generate datasets by running:

```sh
make download
make generate
```

## Train

Use, modify, or create a new experiment found at `training/conf/experiment/`.
To run an experiment we first need to enter the virtual env by running:

```sh
poetry shell
```

Then we can train a new model by running:

```sh
python main.py +experiment=conv_transformer_paragraphs
```

## Network

Create a picture of the network and place it here

## Graveyard

Ideas of mine that did not work unfortunately:

* Efficientnet was apparently a terrible choice of an encoder
  - A ConvNext module heavily copied from lucidrains [x-unet](https://github.com/lucidrains/x-unet)
  was incredibly much better at encoding the images to a better representation.

* Use VQVAE to create pre-train a good latent representation
  - Tests with various compressions did not show any performance increase compared to training directly e2e, more like decrease to be honest
  - This is very unfortunate as I really hoped that this idea would work :(
  - I still really like this idea, and I might not have given up just yet...
  - I have now given up... :( ConvNext ftw

* Axial Transformer Encoder
  - Added a lot of extra parameters with no gain in performance
  - Cool idea, but on a single GPU

* Word Pieces
  - Might have worked better, but liked the idea of single character recognition more.

## Todo
- [ ] remove einops (try)
- [ ] Tests
- [ ] Evaluation
- [ ] Wandb artifact fetcher
- [ ] fix linting
- [x] Modularize the decoder
- [ ] Add kv cache
- [ ] Train with Laprop
- [x] Fix stems
- [x] residual attn
- [x] single kv head
- [x] fix rotary embedding
- [ ] simplify attention with norm
- [ ] tie embeddings
- [ ] cnn -> tf encoder -> tf decoder