curiosity-notes - Understanding Fastai’s DataBlock API

Introduction

Fastai’s DataBlock API is a flexible and easy way to get and transform your data into something ready to be fed to a model. The documentation and tutorial give some good examples of how to use this API. However, I still find it unclear how each argument in Datablock influences the different steps in the data transformation process.

In this post, I will try to visualize the data transformation process and explain the different steps the DataBlock API can hook into. Hopefully, this will make it easier for people to customize this process using this API.

DataBlock API as A Blueprint for Data Transformation

The data transformation process constructed by Fastai is depicted as follows, where the colored steps can be configured through the DataBlock API:

How different parts of DataBlock API fit in the data loading process.

get_item: This is the first optional function that takes the source as input, and returns items to be processed by the Datasets object in Fastai. If it’s not specified, the source will be passed directly to Datasets.
splitter: The splitter is a function that returns two lists of indices, one for the training data set and one for the validation data set so that Datasets and Dataloaders know how to split the data into train/valid.
getters (or get_x, get_y): getters are a list of functions that is applied to each item passed to them to generate corresponding x and y. You can also specify get_x and/or get_y specifically. They will overwerite getters as get_x + get_y. Note, the number of functions in get_x and get_y should be consistent with the number of x and y specified in blocks and n_inp (see below).
type_tfms: These are collected from your TranformBlocks you specified in blocks (explained below). They will transform your x and y respectively, according to the blocks you used for x and y.
item_tfms: A list of Transforms to be combined with the item_tfms in your TransformBlocks for x and y, respectively. These are applied to each item in the Dataloaders object before batches are created. For example, we can resize each image to the same size at this step.
batch_tfms: A list of Transforms to be combined with the batch_tfms in your TransformBlocks for x and y specified in blocks. These are applied after a batch is created.

(Note that the Transforms need to be different based on what type of object it’s applied on, whether it’s the training or validation set. This is achieved through type dispatch and the split_idx variable in Transform respectively. Please see the next section Deeper Dive for more details.)

An example usage of the DataBlock API:

dblock = DataBlock(
    blocks=(ImageBlock, BBoxBlock, BBoxLblBlock), 
    n_inp=1, 
    get_items=get_image_files,
    splitter=RandomSplitter(),
    getters=None, 
    get_x=None,
    get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]], 
    item_tfms=Resize(128),
    batch_tfms=aug_transforms()
)

dls = dblock.dataloaders(path_to_image_folder)

blocks: These can be TransformBlock objects that simultaneously specify the types of your x and y, and their associated type_tfms, item_tfms and batch_tfms to be inserted in the data transformation process for each type. The objects here do not HAVE TO be TransformBlock though. If they are not, they do not contribute any type_tfms, item_tfms and batch_tfms to the data transformation process.

A TransformBlock is simply defined as:

class TransformBlock():
    "A basic wrapper that links defaults transforms for the data block API"
    def __init__(self, type_tfms=None, item_tfms=None, batch_tfms=None, dl_type=None, dls_kwargs=None):
        self.type_tfms  =            L(type_tfms)
        self.item_tfms  = ToTensor + L(item_tfms)
        self.batch_tfms =            L(batch_tfms)
        self.dl_type,self.dls_kwargs = dl_type,({} if dls_kwargs is None else dls_kwargs)

n_inp: This specifies the number of inputs (x) so that the API knows the first n_inp TransformBlock are for x and the rest for y. In this case, we have one x and two ys. The model will need to predict for both the bounding box and the label.
item_tfms, batch_ftms: additional Transforms to be combined with those specified in the TransformBlocks in blocks.
get_items, splitter, getters, get_x, get_y have one-to-one correspondance to those explained in the blueprint.

Now, we can do a quick walkthrough of the example dblock:

path_to_image_folder is fed into get_image_files to generate a list of paths to all images.
RandomSplitter() takes this list of paths and generates randomized lists of indices for train and valid data sets. These indices will be held in Datasets and Dataloaders to generate the final train/valid data.
Because get_x is not specified, these paths are passed through directly as our x at this point.
The same list of paths is passed to get_y which has two functions, one looking up the file name to find its corresponding bounding box data and the other the corresponding label. Because n_inp=1, we get two ys.
The list of paths in x is passed through type_tfms specified in ImageBlock to open the images. type_tfms in BBloxBlock and BBoxLablBlock are applied to the bounding box and label data respectively.
The item_tfms is appended to any item_tfms specified in ImageBlock, BBloxBlock, BBoxLablBlock respectively and then applied to our x and y to resize the images and the bounding box while leaving the label unchanged. (Again, a Transform object can specify different transformations to be applied to different object types.)
Finally, after a batch is created we apply aug_transforms() to generate a list of fliped, rotated, zoomed, warped, lighting-adjusted images as augmentations of our data for training.

Deeper Dive

Transform and Pipeline

The Transform object can have multiple encodes/decodes methods defined for different argument types. The appropriate encodes/decodes method will be chosen based on the first argument’s type. This is achieved through type dispatch. For this more advanced topic, please watch Jeremy Howard’s code walk-through video on how the Transform class is built:

You can find many examples in the Transform documentation. Here I only give the most basic demostrations of Transform:

class Times2(Transform):
    def encodes(self, x):
        return x*2
    
    def decodes(self, x):
        return x/2

@Transform
def plus3(x:int): return x+3

times2 = Times2()
# tests
test_eq(times2(1), 2)
test_eq(plus3(1), 4)
test_eq(plus3(1.0), 1)

As can be seen, the plus3 function is only applied to integer types.

A Pipeline simply applies a list of Transforms to an item. You can optionally specify split_idx to turn on only transforms with the same split_idx. Please see the Pipeline doc for more details.

p = Pipeline([times2,plus3])
# tests
test_eq(p(1), 5)
test_eq(p(1.0), 2)

TfmdLists and Datasets

TfmdLists applies a list of Transforms or a Pipeline to a list of items, with the option to split the list into training data set vs. validation data set by specifying splits as lists of indices:

tf = TfmdLists([1, 2, 5], p, splits=[[0],[1,2]])
tf.train[:]
tf.valid[:]

(#1) [5]

(#2) [7,13]

Datasets further expands the functionality so that you can specifiy multiple Pipelines or lists of Transforms to a list of items, returning a list of tuples with each item in the tuple corresponding to the result of one Pipeline.

ds = Datasets([1,2,5,-9,11,15], [p, [plus3]], splits=[[0, 1, 2, 3],[4, 5]])
ds.train
ds.valid

(#4) [(5, 4),(7, 5),(13, 8),(-15, -6)]

(#2) [(25, 14),(33, 18)]

From Datasets, we can then get the dataloaders:

dls = ds.dataloaders(bs=2, shuffle=False)
dls.one_batch()

(tensor([5, 7]), tensor([4, 5]))

The same could be achieve by using the DataBlock API directly:

dblock = DataBlock(blocks = (TransformBlock(type_tfms=[times2, plus3]), 
                             TransformBlock(type_tfms=[plus3])),
                   splitter=lambda x: [[0, 1, 2, 3, 4], [5]]
                  )

dls = dblock.dataloaders([1,2,5,-9,11,15], bs=2, shuffle=False)
dls.one_batch()

(tensor([5, 7]), tensor([4, 5]))