Preprocessing rules

The following are rules applied to texts before or after it's tokenized.

test_eq(spec_add_spaces('#fastai'), ' # fastai')
test_eq(spec_add_spaces('/fastai'), ' / fastai')
test_eq(spec_add_spaces('\\fastai'), ' \\ fastai')

test_eq(rm_useless_spaces('a  b   c'), 'a b c')

It starts replacing at 3 repetitions of the same character or more.

test_eq(replace_rep('aa'), 'aa')
test_eq(replace_rep('aaaa'), f' {TK_REP} 4 a ')

It starts replacing at 3 repetitions of the same word or more.

test_eq(replace_wrep('ah ah'), 'ah ah')
test_eq(replace_wrep('ah ah ah'), f' {TK_WREP} 3 ah ')
test_eq(replace_wrep('ah ah   ah  ah'), f' {TK_WREP} 4 ah ')
test_eq(replace_wrep('ah ah ah ah '), f' {TK_WREP} 4 ah  ')
test_eq(replace_wrep('ah ah ah ah.'), f' {TK_WREP} 4 ah .')
test_eq(replace_wrep('ah ah ahi'), f'ah ah ahi')

test_eq(fix_html('#39;bli#146;'), "'bli'")
test_eq(fix_html('Sarah amp; Duck...'), 'Sarah & Duck …')
test_eq(fix_html('a nbsp; #36;'), 'a   $')
test_eq(fix_html('\\" <unk>'), f'" {UNK}')
test_eq(fix_html('quot;  @.@  @-@ '), "' .-")
test_eq(fix_html('<br />text\\n'), '\ntext\n')

test_eq(replace_all_caps("I'M SHOUTING"), f"{TK_UP} i'm {TK_UP} shouting")
test_eq(replace_all_caps("I'm speaking normally"), "I'm speaking normally")
test_eq(replace_all_caps("I am speaking normally"), "i am speaking normally")

test_eq(replace_maj("Jeremy Howard"), f'{TK_MAJ} jeremy {TK_MAJ} howard')
test_eq(replace_maj("I don't think there is any maj here"), ("i don't think there is any maj here"),)

Tokenizing

A tokenizer is a class that must implement __call__. This method receives a iterator of texts and must return a generator with their tokenized versions. Here is the most basic example:

tok = BaseTokenizer()
test_eq(tok(["This is a text"]), [["This", "is", "a", "text"]])
tok = BaseTokenizer('x')
test_eq(tok(["This is a text"]), [["This is a te", "t"]])

tok = SpacyTokenizer()
inp,exp = "This isn't the easiest text.",["This", "is", "n't", "the", "easiest", "text", "."]
test_eq(L(tok([inp,inp])), [exp,exp])

f = TokenizeWithRules(BaseTokenizer(),rules=[replace_all_caps])
test_eq(f(["THIS isn't a problem"]), [[TK_UP, 'this', "isn't", 'a', 'problem']])
f = TokenizeWithRules(SpacyTokenizer())
test_eq(f(["This isn't a problem"]), [[BOS, TK_MAJ, 'this', 'is', "n't", 'a', 'problem']])
f = TokenizeWithRules(BaseTokenizer(split_char="'"), rules=[])
test_eq(f(["This isn't a problem"]), [['This▁isn', 't▁a▁problem']])

The main function that will be called during one of the processes handling tokenization. It will iterate through the batch of texts, apply them rules and tokenize them.

texts = ["this is a text", "this is another text"]
tok = TokenizeWithRules(BaseTokenizer(), texts.__getitem__)
test_eq(tok([0,1]), [['this', 'is', 'a', 'text'],['this', 'is', 'another', 'text']])

test_eq(tokenize1("This isn't a problem", SpacyTokenizer()),
        [BOS, TK_MAJ, 'this', 'is', "n't", 'a', 'problem'])
test_eq(tokenize1("This isn't a problem", tok=BaseTokenizer(), rules=[]),
        ['This',"isn't",'a','problem'])

Note that since this uses parallel_gen behind the scenes, the generator returned contains tuples of indices and results. There is no guarantee that the results are returned in order, so you should sort by the first item of the tuples (the indices) if you need them ordered.

res  = parallel_tokenize(['0 1', '1 2'], rules=[], n_workers=2)
idxs,toks = zip(*L(res).sorted(itemgetter(0)))
test_eq(toks, [['0','1'],['1','2']])

Tokenize texts in files

Preprocessing function for texts in filenames. Tokenized texts will be saved in a similar fashion in a directory suffixed with _tok in the parent folder of path (override with output_dir). This directory is the return value.

The result will be in output_dir (defaults to a folder in the same parent directory as path, with _tok added to path.name) with the same structure as in path. Tokenized texts for a given file will be in the file having the same name in output_dir. Additionally, a file with a .len suffix contains the number of tokens and the count of all words is stored in output_dir/counter.pkl.

extensions will default to ['.txt'] and all text files in path are treated unless you specify a list of folders in include. rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer.

Tokenize texts in a dataframe

This function returns a new dataframe with the same non-text columns, a column named text that contains the tokenized texts and a column named text_lengths that contains their respective length. It also returns a counter of all seen words to quickly build a vocabulary afterward.

rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer. If mark_fields isn't specified, it defaults to False when there is a single text column, True when there are several. In that case, the texts in each of those columns are joined with FLD markers followed by the number of the field.

The result will be written in a new csv file in outname (defaults to the same as fname with the suffix _tok.csv) and will have the same header as the original file, the same non-text columns, a text and a text_lengths column as described in tokenize_df.

rules (that defaults to defaults.text_proc_rules) are applied to each text before going in the tokenizer. If mark_fields isn't specified, it defaults to False when there is a single text column, True when there are several. In that case, the texts in each of those columns are joined with FLD markers followed by the number of the field.

The csv file is opened with header and optionally with blocks of chunksize at a time. If this argument is passed, each chunk is processed independently and saved in the output file to save memory usage.

def _prepare_texts(tmp_d):
    "Prepare texts in a folder struct in tmp_d, a csv file and returns a dataframe"
    path = Path(tmp_d)/'tmp'
    path.mkdir()
    for d in ['a', 'b', 'c']: 
        (path/d).mkdir()
        for i in range(5):
            with open(path/d/f'text{i}.txt', 'w') as f: f.write(f"This is an example of text {d} {i}")
    
    texts = [f"This is an example of text {d} {i}" for i in range(5) for d in ['a', 'b', 'c']]
    df = pd.DataFrame({'text': texts, 'label': list(range(15))}, columns=['text', 'label'])
    csv_fname = tmp_d/'input.csv'
    df.to_csv(csv_fname, index=False)
    return path,df,csv_fname

with tempfile.TemporaryDirectory() as tmp_d:
    path,df,csv_fname = _prepare_texts(Path(tmp_d))
    items = get_text_files(path)
    splits = RandomSplitter()(items)
    dsets = Datasets(items, [Tokenizer.from_folder(path)], splits=splits)
    print(dsets.train[0])
    
    dsets = Datasets(df, [Tokenizer.from_df('text')], splits=splits)
    print(dsets.train[0][0].text)

((#10) ['xxbos','xxmaj','this','is','an','example','of','text','c','2'],)

('xxbos', 'xxmaj', 'this', 'is', 'an', 'example', 'of', 'text', 'c', '0')

tst = test_set(dsets, ['This is a test', 'this is another test'])
test_eq(tst, [(['xxbos', 'xxmaj', 'this','is','a','test'],), 
              (['xxbos','this','is','another','test'],)])

Sentencepiece

texts = [f"This is an example of text {i}" for i in range(10)]
df = pd.DataFrame({'text': texts, 'label': list(range(10))}, columns=['text', 'label'])
out,cnt = tokenize_df(df, text_cols='text', tok=SentencePieceTokenizer(vocab_sz=34), n_workers=1)

with tempfile.TemporaryDirectory() as tmp_d:
    path,df,csv_fname = _prepare_texts(Path(tmp_d))
    items = get_text_files(path)
    splits = RandomSplitter()(items)
    tok = SentencePieceTokenizer(special_toks=[])
    dsets = Datasets(items, [Tokenizer.from_folder(path, tok=tok)], splits=splits)
    print(dsets.train[0])
    
    dsets = Datasets(df, [Tokenizer.from_df('text', tok=tok)], splits=splits)
    print(dsets.train[0][0].text)

((#33) ['▁xx','b','o','s','▁xx','m','a','j','▁t','h'...],)

(#33) ['▁xx','b','o','s','▁xx','m','a','j','▁t','h'...]

Text core

Preprocessing rules

`spec_add_spaces`[source]

`rm_useless_spaces`[source]

`replace_rep`[source]

`replace_wrep`[source]

`fix_html`[source]

`replace_all_caps`[source]

`replace_maj`[source]

`lowercase`[source]

`replace_space`[source]

Tokenizing

`class` `BaseTokenizer`[source]

`class` `SpacyTokenizer`[source]

`class` `TokenizeWithRules`[source]

`tokenize1`[source]

`parallel_tokenize`[source]

Tokenize texts in files

`tokenize_folder`[source]

`tokenize_files`[source]

Tokenize texts in a dataframe

`tokenize_texts`[source]

`tokenize_df`[source]

`tokenize_csv`[source]

`load_tokenized_csv`[source]

`class` `Tokenizer`[source]

Sentencepiece

`class` `SentencePieceTokenizer`[source]

Text core

Preprocessing rules

spec_add_spaces[source]

rm_useless_spaces[source]

replace_rep[source]

replace_wrep[source]

fix_html[source]

replace_all_caps[source]

replace_maj[source]

lowercase[source]

replace_space[source]

Tokenizing

class BaseTokenizer[source]

class SpacyTokenizer[source]

class TokenizeWithRules[source]

tokenize1[source]

parallel_tokenize[source]

Tokenize texts in files

tokenize_folder[source]

tokenize_files[source]

Tokenize texts in a dataframe

tokenize_texts[source]

tokenize_df[source]

tokenize_csv[source]

load_tokenized_csv[source]

class Tokenizer[source]

Sentencepiece

class SentencePieceTokenizer[source]

`spec_add_spaces`[source]

`rm_useless_spaces`[source]

`replace_rep`[source]

`replace_wrep`[source]

`fix_html`[source]

`replace_all_caps`[source]

`replace_maj`[source]

`lowercase`[source]

`replace_space`[source]

`class` `BaseTokenizer`[source]

`class` `SpacyTokenizer`[source]

`class` `TokenizeWithRules`[source]

`tokenize1`[source]

`parallel_tokenize`[source]

`tokenize_folder`[source]

`tokenize_files`[source]

`tokenize_texts`[source]

`tokenize_df`[source]

`tokenize_csv`[source]

`load_tokenized_csv`[source]

`class` `Tokenizer`[source]

`class` `SentencePieceTokenizer`[source]