Get, split, and label

For most data source creation we need functions to get a list of items, split them in to train/valid sets, and label them. fastai provides functions to make each of these steps easy (especially when combined with fastai.data.blocks).

Get

First we'll look at functions that get a list of items (generally file names).

We'll use tiny MNIST (a subset of MNIST with just two classes, 7s and 3s) for our examples/tests throughout this page.

path = untar_data(URLs.MNIST_TINY)
(path/'train').ls()

(#2) [Path('/home/jhoward/.fastai/data/mnist_tiny/train/7'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/3')]

This is the most general way to grab a bunch of file names from disk. If you pass extensions (including the .) then returned file names are filtered by that list. Only those files directly in path are included, unless you pass recurse, in which case all child folders are also searched recursively. folders is an optional list of directories to limit the search to.

t3 = get_files(path/'train'/'3', extensions='.png', recurse=False)
t7 = get_files(path/'train'/'7', extensions='.png', recurse=False)
t  = get_files(path/'train', extensions='.png', recurse=True)
test_eq(len(t), len(t3)+len(t7))
test_eq(len(get_files(path/'train'/'3', extensions='.jpg', recurse=False)),0)
test_eq(len(t), len(get_files(path, extensions='.png', recurse=True, folders='train')))
t

(#709) [Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/723.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/7446.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/8566.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/9200.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/7085.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/8665.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/7348.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/9283.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/9854.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/9548.png')...]

It's often useful to be able to create functions with customized behavior. fastai.data generally uses functions named as CamelCase verbs ending in er to create these functions. FileGetter is a simple example of such a function creator.

fpng = FileGetter(extensions='.png', recurse=False)
test_eq(len(t7), len(fpng(path/'train'/'7')))
test_eq(len(t), len(fpng(path/'train', recurse=True)))
fpng_r = FileGetter(extensions='.png', recurse=True)
test_eq(len(t), len(fpng_r(path/'train')))

This is simply get_files called with a list of standard image extensions.

test_eq(len(t), len(get_image_files(path, recurse=True, folders='train')))

Same as FileGetter, but for image extensions.

test_eq(len(get_files(path/'train', extensions='.png', recurse=True, folders='3')),
        len(ImageGetter(   'train',                    recurse=True, folders='3')(path)))

test_eq(ItemGetter(1)((1,2,3)),  2)
test_eq(ItemGetter(1)(L(1,2,3)), 2)
test_eq(ItemGetter(1)([1,2,3]),  2)
test_eq(ItemGetter(1)(np.array([1,2,3])),  2)

test_eq(AttrGetter('shape')(torch.randn([4,5])), [4,5])
test_eq(AttrGetter('shape', [0])([4,5]), [0])

Split

The next set of functions are used to split data into training and validation sets. The functions return two lists - a list of indices or masks for each of training and validation sets.

src = list(range(30))
f = RandomSplitter(seed=42)
trn,val = f(src)
assert 0<len(trn)<len(src)
assert all(o not in val for o in trn)
test_eq(len(trn), len(src)-len(val))
# test random seed consistency
test_eq(f(src)[0], trn)

Use scikit-learn train_test_split. This allow to split items in a stratified fashion (uniformely according to the ‘labels‘ distribution)

src = list(range(30))
labels = [0] * 20 + [1] * 10
test_size = 0.2

f = TrainTestSplitter(test_size=test_size, random_state=42, stratify=labels)
trn,val = f(src)
assert 0<len(trn)<len(src)
assert all(o not in val for o in trn)
test_eq(len(trn), len(src)-len(val))

# test random seed consistency
test_eq(f(src)[0], trn)

# test labels distribution consistency
# there should be test_size % of zeroes and ones respectively in the validation set
test_eq(len([t for t in val if t < 20]) / 20, test_size)
test_eq(len([t for t in val if t > 20]) / 10, test_size)

items = list(range(10))
splitter = IndexSplitter([3,7,9])
test_eq(splitter(items),[[0,1,2,4,5,6,8],[3,7,9]])

fnames = [path/'train/3/9932.png', path/'valid/7/7189.png', 
          path/'valid/7/7320.png', path/'train/7/9833.png',  
          path/'train/3/7666.png', path/'valid/3/925.png',
          path/'train/7/724.png', path/'valid/3/93055.png']
splitter = GrandparentSplitter()
test_eq(splitter(fnames),[[0,3,4,6],[1,2,5,7]])

fnames2 = fnames + [path/'test/3/4256.png', path/'test/7/2345.png', path/'valid/7/6467.png']
splitter = GrandparentSplitter(train_name=('train', 'valid'), valid_name='test')
test_eq(splitter(fnames2),[[0,3,4,6,1,2,5,7,10],[8,9]])

splitter = FuncSplitter(lambda o: Path(o).parent.parent.name == 'valid')
test_eq(splitter(fnames),[[0,3,4,6],[1,2,5,7]])

items = list(range(6))
splitter = MaskSplitter([True,False,False,True,False,True])
test_eq(splitter(items),[[1,2,4],[0,3,5]])

with tempfile.TemporaryDirectory() as d:
    fname = Path(d)/'valid.txt'
    fname.write('\n'.join([Path(fnames[i]).name for i in [1,3,4]]))
    splitter = FileSplitter(fname)
    test_eq(splitter(fnames),[[0,2,5,6,7],[1,3,4]])

df = pd.DataFrame({'a': [0,1,2,3,4], 'b': [True,False,True,True,False]})
splits = ColSplitter('b')(df)
test_eq(splits, [[1,4], [0,2,3]])
#Works with strings or index
splits = ColSplitter(1)(df)
test_eq(splits, [[1,4], [0,2,3]])

items = list(range(100))
valid_idx = list(np.arange(70,100))
splits = RandomSubsetSplitter(0.3, 0.1)(items)
test_eq(len(splits[0]), 30)
test_eq(len(splits[1]), 10)

Label

The final set of functions is used to label a single item of data.

Note that parent_label doesn't have anything customize, so it doesn't return a function - you can just use it directly.

test_eq(parent_label(fnames[0]), '3')
test_eq(parent_label("fastai_dev/dev/data/mnist_tiny/train/3/9932.png"), '3')
[parent_label(o) for o in fnames]

['3', '7', '7', '7', '3', '3', '7', '3']

RegexLabeller is a very flexible function since it handles any regex search of the stringified item. Pass match=True to use re.match (i.e. check only start of string), or re.search otherwise (default).

For instance, here's an example the replicates the previous parent_label results.

f = RegexLabeller(fr'{os.path.sep}(\d){os.path.sep}')
test_eq(f(fnames[0]), '3')
[f(o) for o in fnames]

['3', '7', '7', '7', '3', '3', '7', '3']

f = RegexLabeller(r'(\d*)', match=True)
test_eq(f(fnames[0].name), '9932')

cols can be a list of column names or a list of indices (or a mix of both). If label_delim is passed, the result is split using it.

df = pd.DataFrame({'a': 'a b c d'.split(), 'b': ['1 2', '0', '', '1 2 3']})
f = ColReader('a', pref='0', suff='1')
test_eq([f(o) for o in df.itertuples()], '0a1 0b1 0c1 0d1'.split())

f = ColReader('b', label_delim=' ')
test_eq([f(o) for o in df.itertuples()], [['1', '2'], ['0'], [], ['1', '2', '3']])

df['a1'] = df['a']
f = ColReader(['a', 'a1'], pref='0', suff='1')
test_eq([f(o) for o in df.itertuples()], [L('0a1', '0a1'), L('0b1', '0b1'), L('0c1', '0c1'), L('0d1', '0d1')])

df = pd.DataFrame({'a': [L(0,1), L(2,3,4), L(5,6,7)]})
f = ColReader('a')
test_eq([f(o) for o in df.itertuples()], [L(0,1), L(2,3,4), L(5,6,7)])

df['name'] = df['a']
f = ColReader('name')
test_eq([f(df.iloc[0,:])], [L(0,1)])

t = CategoryMap([4,2,3,4])
test_eq(t, [2,3,4])
test_eq(t.o2i, {2:0,3:1,4:2})
test_eq(t.map_objs([2,3]), [0,1])
test_eq(t.map_ids([0,1]), [2,3])
test_fail(lambda: t.o2i['unseen label'])

t = CategoryMap([4,2,3,4], add_na=True)
test_eq(t, ['#na#',2,3,4])
test_eq(t.o2i, {'#na#':0,2:1,3:2,4:3})

t = CategoryMap(pd.Series([4,2,3,4]), sort=False)
test_eq(t, [4,2,3])
test_eq(t.o2i, {4:0,2:1,3:2})

col = pd.Series(pd.Categorical(['M','H','L','M'], categories=['H','M','L'], ordered=True))
t = CategoryMap(col)
test_eq(t, ['H','M','L'])
test_eq(t.o2i, {'H':0,'M':1,'L':2})

col = pd.Series(pd.Categorical(['M','H','M'], categories=['H','M','L'], ordered=True))
t = CategoryMap(col, strict=True)
test_eq(t, ['H','M'])
test_eq(t.o2i, {'H':0,'M':1})

cat = Categorize()
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['cat', 'dog'])
test_eq(cat('cat'), 0)
test_eq(cat.decode(1), 'dog')
test_stdout(lambda: show_at(tds,2), 'cat')

cat = Categorize(add_na=True)
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['#na#', 'cat', 'dog'])
test_eq(cat('cat'), 1)
test_eq(cat.decode(2), 'dog')
test_stdout(lambda: show_at(tds,2), 'cat')

cat = Categorize(vocab=['dog', 'cat'], sort=False, add_na=True)
tds = Datasets(['cat', 'dog', 'cat'], tfms=[cat])
test_eq(cat.vocab, ['#na#', 'dog', 'cat'])
test_eq(cat('dog'), 1)
test_eq(cat.decode(2), 'cat')
test_stdout(lambda: show_at(tds,2), 'cat')

cat = MultiCategorize()
tds = Datasets([['b', 'c'], ['a'], ['a', 'c'], []], tfms=[cat])
test_eq(tds[3][0], TensorMultiCategory([]))
test_eq(cat.vocab, ['a', 'b', 'c'])
test_eq(cat(['a', 'c']), tensor([0,2]))
test_eq(cat([]), tensor([]))
test_eq(cat.decode([1]), ['b'])
test_eq(cat.decode([0,2]), ['a', 'c'])
test_stdout(lambda: show_at(tds,2), 'a;c')

Works in conjunction with MultiCategorize or on its own if you have one-hot encoded targets (pass a vocab for decoding and do_encode=False in this case)

_tfm = OneHotEncode(c=3)
test_eq(_tfm([0,2]), tensor([1.,0,1]))
test_eq(_tfm.decode(tensor([0,1,1])), [1,2])

tds = Datasets([['b', 'c'], ['a'], ['a', 'c'], []], [[MultiCategorize(), OneHotEncode()]])
test_eq(tds[1], [tensor([1.,0,0])])
test_eq(tds[3], [tensor([0.,0,0])])
test_eq(tds.decode([tensor([False, True, True])]), [['b','c']])
test_eq(type(tds[1][0]), TensorMultiCategory)
test_stdout(lambda: show_at(tds,2), 'a;c')

_tfm = EncodedMultiCategorize(vocab=['a', 'b', 'c'])
test_eq(_tfm([1,0,1]), tensor([1., 0., 1.]))
test_eq(type(_tfm([1,0,1])), TensorMultiCategory)
test_eq(_tfm.decode(tensor([False, True, True])), ['b','c'])

_tfm

EncodedMultiCategorize -- {'vocab': (#3) ['a','b','c'], 'add_na': False}:
encodes: (object,object) -> encodes
(object,object) -> encodes
decodes: (object,object) -> decodes
(object,object) -> decodes

_tfm = RegressionSetup()
dsets = Datasets([0, 1, 2], RegressionSetup)
test_eq(dsets.c, 1)
test_eq_type(dsets[0], (tensor(0.),))

dsets = Datasets([[0, 1, 2], [3,4,5]], RegressionSetup)
test_eq(dsets.c, 3)
test_eq_type(dsets[0], (tensor([0.,1.,2.]),))

End-to-end dataset example with MNIST

Let's show how to use those functions to grab the mnist dataset in a Datasets. First we grab all the images.

path = untar_data(URLs.MNIST_TINY)
items = get_image_files(path)

Then we split between train and validation depending on the folder.

splitter = GrandparentSplitter()
splits = splitter(items)
train,valid = (items[i] for i in splits)
train[:3],valid[:3]

((#3) [Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/723.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/7446.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/train/7/8566.png')],
 (#3) [Path('/home/jhoward/.fastai/data/mnist_tiny/valid/7/946.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/valid/7/9608.png'),Path('/home/jhoward/.fastai/data/mnist_tiny/valid/7/825.png')])

Our inputs are images that we open and convert to tensors, our targets are labeled depending on the parent directory and are categories.

from PIL import Image

def open_img(fn:Path): return Image.open(fn).copy()
def img2tensor(im:Image.Image): return TensorImage(array(im)[None])

tfms = [[open_img, img2tensor],
        [parent_label, Categorize()]]
train_ds = Datasets(train, tfms)

x,y = train_ds[3]
xd,yd = decode_at(train_ds,3)
test_eq(parent_label(train[3]),yd)
test_eq(array(Image.open(train[3])),xd[0].numpy())

ax = show_at(train_ds, 3, cmap="Greys", figsize=(1,1))

assert ax.title.get_text() in ('3','7')
test_fig_exists(ax)

t = (TensorImage(tensor(1)),tensor(2).long(),TensorMask(tensor(3)))
tfm = IntToFloatTensor()
ft = tfm(t)
test_eq(ft, [1./255, 2, 3])
test_eq(type(ft[0]), TensorImage)
test_eq(type(ft[2]), TensorMask)
test_eq(ft[0].type(),'torch.FloatTensor')
test_eq(ft[1].type(),'torch.LongTensor')
test_eq(ft[2].type(),'torch.LongTensor')

mean,std = [0.5]*3,[0.5]*3
mean,std = broadcast_vec(1, 4, mean, std)
batch_tfms = [IntToFloatTensor(), Normalize.from_stats(mean,std)]
tdl = TfmdDL(train_ds, after_batch=batch_tfms, bs=4, device=default_device())

x,y  = tdl.one_batch()
xd,yd = tdl.decode((x,y))

test_eq(x.type(), 'torch.cuda.FloatTensor' if default_device().type=='cuda' else 'torch.FloatTensor')
test_eq(xd.type(), 'torch.LongTensor')
test_eq(type(x), TensorImage)
test_eq(type(y), TensorCategory)
assert x.mean()<0.0
assert x.std()>0.5
assert 0<xd.float().mean()/255.<1
assert 0<xd.float().std()/255.<0.5

from fastai.vision.core import *

tdl.show_batch((x,y))

Helper functions for processing data and basic transforms

Get, split, and label

Get

`get_files`[source]

`FileGetter`[source]

`get_image_files`[source]

`ImageGetter`[source]

`get_text_files`[source]

`class` `ItemGetter`[source]

`class` `AttrGetter`[source]

Split

`RandomSplitter`[source]

`TrainTestSplitter`[source]

`IndexSplitter`[source]

`GrandparentSplitter`[source]

`FuncSplitter`[source]

`MaskSplitter`[source]

`FileSplitter`[source]

`ColSplitter`[source]

`RandomSubsetSplitter`[source]

Label

`parent_label`[source]

`class` `RegexLabeller`[source]

`class` `ColReader`[source]

`class` `CategoryMap`[source]

`class` `Categorize`[source]

`class` `Category`[source]

`class` `MultiCategorize`[source]

`class` `MultiCategory`[source]

`class` `OneHotEncode`[source]

`class` `EncodedMultiCategorize`[source]

`class` `RegressionSetup`[source]

`get_c`[source]

End-to-end dataset example with MNIST

`class` `ToTensor`[source]

`class` `IntToFloatTensor`[source]

`broadcast_vec`[source]

`class` `Normalize`[source]

Helper functions for processing data and basic transforms

Get, split, and label

Get

get_files[source]

FileGetter[source]

get_image_files[source]

ImageGetter[source]

get_text_files[source]

class ItemGetter[source]

class AttrGetter[source]

Split

RandomSplitter[source]

TrainTestSplitter[source]

IndexSplitter[source]

GrandparentSplitter[source]

FuncSplitter[source]

MaskSplitter[source]

FileSplitter[source]

ColSplitter[source]

RandomSubsetSplitter[source]

Label

parent_label[source]

class RegexLabeller[source]

class ColReader[source]

class CategoryMap[source]

class Categorize[source]

class Category[source]

class MultiCategorize[source]

class MultiCategory[source]

class OneHotEncode[source]

class EncodedMultiCategorize[source]

class RegressionSetup[source]

get_c[source]

End-to-end dataset example with MNIST

class ToTensor[source]

class IntToFloatTensor[source]

broadcast_vec[source]

class Normalize[source]

`get_files`[source]

`FileGetter`[source]

`get_image_files`[source]

`ImageGetter`[source]

`get_text_files`[source]

`class` `ItemGetter`[source]

`class` `AttrGetter`[source]

`RandomSplitter`[source]

`TrainTestSplitter`[source]

`IndexSplitter`[source]

`GrandparentSplitter`[source]

`FuncSplitter`[source]

`MaskSplitter`[source]

`FileSplitter`[source]

`ColSplitter`[source]

`RandomSubsetSplitter`[source]

`parent_label`[source]

`class` `RegexLabeller`[source]

`class` `ColReader`[source]

`class` `CategoryMap`[source]

`class` `Categorize`[source]

`class` `Category`[source]

`class` `MultiCategorize`[source]

`class` `MultiCategory`[source]

`class` `OneHotEncode`[source]

`class` `EncodedMultiCategorize`[source]

`class` `RegressionSetup`[source]

`get_c`[source]

`class` `ToTensor`[source]

`class` `IntToFloatTensor`[source]

`broadcast_vec`[source]

`class` `Normalize`[source]