diff options
Diffstat (limited to 'README.md')
-rw-r--r-- | README.md | 15 |
1 files changed, 15 insertions, 0 deletions
@@ -7,8 +7,23 @@ Implementing the text recognizer project from the course ["Full Stack Deep Learn TBC +### Build word piece dataset +Extract text from the iam dataset: +``` +poetry run extract-iam-text --use_words --save_text train.txt --save_tokens letters.txt +``` +Create word pieces from the extracted training text: +``` +poetry run make-wordpieces --output_prefix iamdb_1kwp --text_file train.txt --num_pieces 100 +``` + +Optionally, build a transition graph for word pieces: +``` +poetry run build-transitions --tokens iamdb_1kwp_tokens_1000.txt --lexicon iamdb_1kwp_lex_1000.txt --blank optional --self_loops --save_path 1kwp_prune_0_10_optblank.bin --prune 0 10 +``` +(TODO: Not working atm, needed for GTN loss function) ## Todo - [x] create wordpieces |