11 Data Munging with fastai’s Mid-Level API

We have seen what Tokenizer and Numericalize do to a collection of texts, and how they’re used inside the data block API, which handles those transforms for us directly using the TextBlock. But what if we want to only apply one of those transforms, either to see intermediate results or because we have already tokenized texts? More generally, what can we do when the data block API is not flexible enough to accommodate our particular use case? For this, we need to use fastai’s mid-level API for processing data. The data block API is built on top of that layer, so it will allow you to do everything the data block API does, and much much more.

11.1 Going Deeper into fastai’s Layered API

The fastai library is built on a layered API. In the very top layer there are applications that allow us to train a model in five lines of codes, as we saw in Chapter 1. In the case of creating DataLoaders for a text classifier, for instance, we used the line:

from fastai.text.all import *

dls = TextDataLoaders.from_folder(untar_data(URLs.IMDB), valid='test')

The factory method TextDataLoaders.from_folder is very convenient when your data is arranged the exact same way as the IMDb dataset, but in practice, that often won’t be the case. The data block API offers more flexibility.

This is just a preview of this chapter. The rest of this chapter is not available here, but you read the source notebook which has the same content (but with less nice formatting).