summaryrefslogtreecommitdiff
path: root/README.md
diff options
context:
space:
mode:
authorGustaf Rydholm <gustaf.rydholm@gmail.com>2021-03-29 21:39:58 +0200
committerGustaf Rydholm <gustaf.rydholm@gmail.com>2021-03-29 21:39:58 +0200
commitd21594211e29c40c135b753e33b248b0737cd76f (patch)
treeea37e2701fdc45b27815d8831e6b60ff6888e168 /README.md
parent46a1472d33d3a4180798492e819f2ec02bc3b1a3 (diff)
Refactor word piece scripts
Diffstat (limited to 'README.md')
-rw-r--r--README.md15
1 files changed, 15 insertions, 0 deletions
diff --git a/README.md b/README.md
index ac4acd8..cfe37ff 100644
--- a/README.md
+++ b/README.md
@@ -7,8 +7,23 @@ Implementing the text recognizer project from the course ["Full Stack Deep Learn
TBC
+### Build word piece dataset
+Extract text from the iam dataset:
+```
+poetry run extract-iam-text --use_words --save_text train.txt --save_tokens letters.txt
+```
+Create word pieces from the extracted training text:
+```
+poetry run make-wordpieces --output_prefix iamdb_1kwp --text_file train.txt --num_pieces 100
+```
+
+Optionally, build a transition graph for word pieces:
+```
+poetry run build-transitions --tokens iamdb_1kwp_tokens_1000.txt --lexicon iamdb_1kwp_lex_1000.txt --blank optional --self_loops --save_path 1kwp_prune_0_10_optblank.bin --prune 0 10
+```
+(TODO: Not working atm, needed for GTN loss function)
## Todo
- [x] create wordpieces