Introduction
Fastai’s DataBlock API
is a flexible and easy way to get and transform your data into something ready to be fed to a model. The documentation and tutorial give some good examples of how to use this API. However, I still find it unclear how each argument in Datablock
influences the different steps in the data transformation process.
In this post, I will try to visualize the data transformation process and explain the different steps the DataBlock API can hook into. Hopefully, this will make it easier for people to customize this process using this API.
DataBlock API as A Blueprint for Data Transformation
The data transformation process constructed by Fastai is depicted as follows, where the colored steps can be configured through the DataBlock API:
get_item: This is the first optional function that takes the source as input, and returns items to be processed by the
Datasets
object in Fastai. If it’s not specified, the source will be passed directly toDatasets
.splitter: The splitter is a function that returns two lists of indices, one for the training data set and one for the validation data set so that
Datasets
andDataloaders
know how to split the data into train/valid.getters (or get_x, get_y): getters are a list of functions that is applied to each item passed to them to generate corresponding x and y. You can also specify get_x and/or get_y specifically. They will overwerite getters as
get_x + get_y
. Note, the number of functions in get_x and get_y should be consistent with the number of x and y specified in blocks and n_inp (see below).type_tfms: These are collected from your
TranformBlock
s you specified in blocks (explained below). They will transform your x and y respectively, according to the blocks you used for x and y.item_tfms: A list of
Transform
s to be combined with the item_tfms in yourTransformBlock
s for x and y, respectively. These are applied to each item in theDataloaders
object before batches are created. For example, we can resize each image to the same size at this step.batch_tfms: A list of
Transform
s to be combined with the batch_tfms in yourTransformBlock
s for x and y specified in blocks. These are applied after a batch is created.
(Note that the Transform
s need to be different based on what type of object it’s applied on, whether it’s the training or validation set. This is achieved through type dispatch and the split_idx
variable in Transform
respectively. Please see the next section Deeper Dive for more details.)
An example usage of the DataBlock
API:
dblock = DataBlock(
blocks=(ImageBlock, BBoxBlock, BBoxLblBlock),
n_inp=1,
get_items=get_image_files,
splitter=RandomSplitter(),
getters=None,
get_x=None,
get_y=[lambda o: img2bbox[o.name][0], lambda o: img2bbox[o.name][1]],
item_tfms=Resize(128),
batch_tfms=aug_transforms()
)
dls = dblock.dataloaders(path_to_image_folder)
blocks: These can be
TransformBlock
objects that simultaneously specify the types of your x and y, and their associated type_tfms, item_tfms and batch_tfms to be inserted in the data transformation process for each type. The objects here do not HAVE TO beTransformBlock
though. If they are not, they do not contribute any type_tfms, item_tfms and batch_tfms to the data transformation process.A
TransformBlock
is simply defined as:class TransformBlock(): "A basic wrapper that links defaults transforms for the data block API" def __init__(self, type_tfms=None, item_tfms=None, batch_tfms=None, dl_type=None, dls_kwargs=None): self.type_tfms = L(type_tfms) self.item_tfms = ToTensor + L(item_tfms) self.batch_tfms = L(batch_tfms) self.dl_type,self.dls_kwargs = dl_type,({} if dls_kwargs is None else dls_kwargs)
n_inp: This specifies the number of inputs (x) so that the API knows the first n_inp
TransformBlock
are for x and the rest for y. In this case, we have one x and two ys. The model will need to predict for both the bounding box and the label.item_tfms, batch_ftms: additional
Transform
s to be combined with those specified in theTransformBlocks
in blocks.get_items, splitter, getters, get_x, get_y have one-to-one correspondance to those explained in the blueprint.
Now, we can do a quick walkthrough of the example dblock
:
path_to_image_folder
is fed intoget_image_files
to generate a list of paths to all images.RandomSplitter()
takes this list of paths and generates randomized lists of indices for train and valid data sets. These indices will be held inDatasets
andDataloaders
to generate the final train/valid data.Because get_x is not specified, these paths are passed through directly as our x at this point.
The same list of paths is passed to get_y which has two functions, one looking up the file name to find its corresponding bounding box data and the other the corresponding label. Because n_inp=1, we get two ys.
The list of paths in x is passed through type_tfms specified in
ImageBlock
to open the images. type_tfms inBBloxBlock
andBBoxLablBlock
are applied to the bounding box and label data respectively.The item_tfms is appended to any item_tfms specified in
ImageBlock
,BBloxBlock
,BBoxLablBlock
respectively and then applied to our x and y to resize the images and the bounding box while leaving the label unchanged. (Again, aTransform
object can specify different transformations to be applied to different object types.)Finally, after a batch is created we apply
aug_transforms()
to generate a list of fliped, rotated, zoomed, warped, lighting-adjusted images as augmentations of our data for training.
Deeper Dive
Transform and Pipeline
The Transform
object can have multiple encodes
/decodes
methods defined for different argument types. The appropriate encodes
/decodes
method will be chosen based on the first argument’s type. This is achieved through type dispatch. For this more advanced topic, please watch Jeremy Howard’s code walk-through video on how the Transform class is built:
You can find many examples in the Transform documentation. Here I only give the most basic demostrations of Transform
:
As can be seen, the plus3 function is only applied to integer types.
A Pipeline
simply applies a list of Transforms to an item. You can optionally specify split_idx
to turn on only transforms with the same split_idx
. Please see the Pipeline doc for more details.
TfmdLists and Datasets
TfmdLists
applies a list of Transforms or a Pipeline to a list of items, with the option to split the list into training data set vs. validation data set by specifying splits
as lists of indices:
Datasets
further expands the functionality so that you can specifiy multiple Pipelines or lists of Transforms to a list of items, returning a list of tuples with each item in the tuple corresponding to the result of one Pipeline.
(#4) [(5, 4),(7, 5),(13, 8),(-15, -6)]
(#2) [(25, 14),(33, 18)]
From Datasets
, we can then get the dataloaders:
The same could be achieve by using the DataBlock API directly: