Plot Digitizer - Produce Random Synthetic Images

Posted Dec 30, 2023

By D. Yixuan 10 min read

I. Objectives

To build a generalized model for extracting accurate digits from plots, like line plots, dot-line plots, scatters, vertical and horizontal plots and of course pie charts, etc. Here are some examples.

Sample plots of different types

In this work, we intend to solve this problem in a sequence-based manner. That is to say, based on deformable detection transformer (DETR) and other important query-based methods, the input images will be converted into sequential digits with variable length. Yet it is still too soon to get into details. As for this first log, we mainly focus on how to produce random sythetic plots for our dataset.

II. Draw by Matplotlib

First we try to accomplish three different typical types: bar, scatter and dot-line. We here introduce matplotlib for random drawing.

  
from matplotlib import pyplot as plt

We don’t want to give too much pressure on our first test model. Therefore all our images are set into a fixed size via:

  
plt.rcParams['figure.figsize'] = (5, 5)

Thus the plots’ general appearance are unified and all in a matplotlib style. Later we will try to introduce a public dataset, like FigureQA, to boost the generalization ability. But today we just want to know if it’s work at all.

bar plot

  
  bar_count = np.random.randint(2, 10)
  x = np.linspace(0, 1, bar_count)
  y = np.random.random(bar_count)

  plt.figure()

  plt.bar(x, y, width=1/bar_count)
  plt.close()

scatter plot

  
  scatter_counts = np.random.randint(9, 15)
  x = np.random.random(scatter_counts)
  y = np.random.random(scatter_counts)

  plt.figure()
  plt.xlim(0, 1)
  plt.ylim(0, 1)
  colors = np.random.rand(scatter_counts)

  plt.scatter(x, y, s=80, c=colors, alpha=1)
  plt.close()

dot-line plot

  
  keypoint_counts = np.random.randint(3, 10)
  x = np.linspace(0.1, 0.9, keypoint_counts)
  y = np.random.random(keypoint_counts)

  plt.figure()
  plt.xlim(0, 1)
  plt.ylim(0, 1)
    
  plt.plot(x, y, marker="*", ms=10, linewidth=4.0, mfc="4CAF50", alpha=0.6)
  plt.close()

The so-called digits from the plot is composed by several keypoints which consist of x, y representing the x-, y- axis value in percentage (0 ~ 1, so far it is enough). Considering it is genrally a continuous variable on x-aixs in dot-line plot and scatters, we adopted ply.xlim to take all part of x-axis into use. If we got a real value of start-point and end-point on x-, y-axis, it would be easy for us to directly translate the percentage value into some real content in the real world. But in the bar plot, it is discrete on x-axis. Only the predicted y percentage-value matters here actually. A much more sophisticated reasoning on x-axis and plot-legend will be needed at the end.

The number of keypoints equals to the final predicted sequnence’s length, it is variable and defined by _counts.

bar, scatters and dot-line

Ok, so we can produce random synthetic images now. They are all unified into similiar form and style. It is not perfect but enough for our start test. Let’s put them into the sub-class of Dataset from torch then.

III. Dataset Class

a. DataLoader Introduction

We need a customized Dataset sub-class for our data object.

All dataset that represent a map from keys to data samples should subclass it. All subclasses should overwrite __getitem__(), supporting fetching a data sample for a given key. Subclasses could also optionally overwrite __len__(), which is expected to return the size of the dataset by many Sampler implementations and the default options of DataLoader.

  
class DataLoader(object):
    ...

    def __next__(self):
        if self.num_workers == 0:  
            indices = next(self.sample_iter)  
            batch = self.collate_fn([self.dataset[i] for i in indices]) 
            if self.pin_memory:
                batch = _utils.pin_memory.pin_memory_batch(batch)
            return batch

This is a part of source code from DataLoader.__next__:

DataLoader by default constructs an index sampler to yields these integer indices (iterable, thus can fetch via next) which were later used to fetch class instances from Dataset subclass.
__getitem__ makes it possible to fetch class instancelike it was a list, which is an built-in python function and defined in subclass.
Everytime from dataset[idx] we can only get one single sample, normally it returns a tuple which includes a pair of image and label or bbox. The output has to be reorganized by collate_fn in batch style.
Finally, pass the batch data of this iteration from CPU to GPU via pin_memory_batch.

Thus, what we have to do here is to build a customized Dataset sub-class and define the function, __getitem__ and collate_fn, as needed.

By the way, that is an old version of DataLoader from torch and we only select the section when the num_works was set to zero. It is easy to illustrate from that way. And the code from newest stable version (1.13.0) is like:

  
class _SingleProcessDataLoaderIter(_BaseDataLoaderIter):
    ...

    def _next_data(self):
        index = self._next_index()  # may raise StopIteration
        data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        if self._pin_memory:
            data = _utils.pin_memory.pin_memory(data, self._pin_memory_device)
        return data

b. Generate Data Item

  
    def __getitem__(self, index):
        seed = index + self.start_seed
        np.random.seed(seed)

Index corresponds to the indice we got from sampler and enables us to instantiate an exclusive data sample. By combining it with our original seed value, we got a random different seed every time which means plots with different appearance.

  
        bbox = np.concatenate([
            x[:, None], y[:, None], np.zeros((x.shape[0], 1)), np.zeros((x.shape[0], 1))
        ], aixs=1)

As it mentioned before, we summerized the plot by numerous keypoints, with x- and y-axis value. Yet here we not only include the keypoint axis value but also a weight and height for bbox. It is natural because our inspiration is from general query-based object detection algorithm. We leave it to be temporarily. It has nothing to do with the training yet since we set them all into zero and L1 loss only conduct on coordinates. But later, we will have some adjustment on it to introduce some geometrical constraint into this format and dilate the general unit of these keypoints.

A Memory Leakage

Some unexpected issue happened here when we were trying to turn our PILImage into Tensor: (Version: Python - 3.6.5, Numpy - 1.19.5, Torch - 1.10.1)

  
    PILImage = Image.open(buff).convert('RGB').resize((480, 480))
    TensorImage = ToTensor()(PILImage)          # memory leakage

Memory Leakage on ToTensor

A memory leakage happened here. Because of this, the memory usage rapidly piled up and soon shut down automatically. Memory Profile was used here to seek the leak point. It is quite powerful. It shows that every time when the ToTensor() were used, it spare out some additional space in memory and can not be automatically released after the reference count back to zero, or what we say, finished (I’m a starter on python’s garabage collection mechanism).

We later go to the detail of class ToTensor:

Converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0] if the PIL Image belongs to one of the modes (L, LA, P, I, F, RGB, YCbCr, RGBA, CMYK, 1) or if the numpy.ndarray has dtype = np.uint8

According to that, we reimplement the transformation trying to find where the leakage exactly happened.

  
    img_arr = PILImage.astype(np.float32)   
    img_arr = img_arr / 255.                    # memory leakage
    img_arr = torch.from_numpy(img_arr)
    img_arr = img_arr.permute(2, 0, 1)

Memory Leakage during Calculation on array

It is nature for numpy. Yet what we expect, automatically memory release after that, never happened. We even tried to manually release it via gc.collect() but turned out to be useless. Till then, we still don’t know why. Therefore, we implemented it in a multiprocessing style at the end.

  
    def __getitem__(self, index):
        seed = index + self.start_seed
        with Pool(1) as p:
            return p.apply(self._generate_dataset, [seed])

The plot sample will be generated and transformed to Tensor in _generate_dataset, and the memory will be automatically released after the process were terminated. It works but costs more time for training.

You may encounter following errors for conducting multiprocess here, a detailed discussion on python multiprocessing and apply(), map() function is present in my another post where you can find solutions.

TypeError: run() arguement after * must be an iterable, not int

ValueError: not enough values to unpack (expected *, got 1)

Finally, return the TensorImage and corresponding bbox (keypoints coordinates) and get our dataset sample.

c. Convert Image Metas

When automatic batching is enabled, collate_fn is called with a list of data samples at each time. It is expected to collate the input samples into a batch for yielding from the data loader iterator. The rest of this section describes the behavior of the default collate_fn (default_collate()).
For instance, if each data sample consists of a 3-channel image and an integral class label, i.e., each element of the dataset returns a tuple (image, class_index), the default collate_fn collates a list of such tuples into a single tuple of a batched image tensor and a batched class label Tensor. In particular, the default collate_fn has the following properties:
It always prepends a new dimension as the batch dimension.
It automatically converts NumPy arrays and Python numerical values into PyTorch Tensors.
It preserves the data structure, e.g., if each sample is a dictionary, it outputs a dictionary with the same set of keys but batched Tensors as values (or lists if the values can not be converted into Tensors). Same for list s, tuple s, namedtuple s, etc.

default_collate() would not be enough for this work.

Firstly, turning multiple dataset tuples, (TensorImage, bbox), into batch and transform it into torch.Tensor is necessary. Taking a batch of multiple dataset samples as input:

  
    batch = self.collate_fn([self.dataset[i] for i in indices])   # old DataLoader

Reschedule it via zip(*), and formulate a tensor image batch:

  
    TensorImages, bboxes = [*zip(*batch)]       # or via `list(zip(*batch))`
    TensorImgBatch = torch.concat([_.unsqueeze(0) for _ in TensorImages])

Unzip a Tuple Set

Secondly, we need conduct some additional process on bbox information, convert it into metas that includes text query for decode. As mentioned before, every plot is formed by a sequence of keypoints with a variable length. Calculate the maximum of this length within a batch:

  
    max_seq_length = max([b.shape[0] for b in bboxes]) + 2      # 2 for additional maskers on start "<SOS>" and end "<EOS>" of a sequence.

Pad with zero on empty blocks, and output an mask on this data:

  
    b_padded = np.zeros((max_seq_length, 4))    # 4 for (x, y, w, h) and it is zero on w and h
    b_padded[1:b.shape[0]+1] = b

    b_mask = np.zeros(max_seq_length)
    b_mask[1:b.shape[0]+1] = 1

Again, here the bbox is nothing similar to an actual bounding box, it’s just for convenience and w and h are all ignored. We are just trying to solve the problem as an Object Detection and based on query-based methods. And the meta information for every plot sample was like:

  
    img_metas.append(
        {
            ...

            'text': ','.join(["<SOS>"]+["<thead>"]*b.shape[0]+["<EOS>"]+["<PAD>"]*(max_seq_length-b.shape[0]-2)),
            'bbox': b_padded, 
            'bbox_masks': b_mask
        }
    )

Finally, we got all we need: random synthetic plots TensorImgBatch, each of them is made up by a sequence of keypoints, and corresponding meta informations img_metas for decoder which consist of ground-truth, query text and mask. That’s the whole objective for this post.

Zero-Project, Plot Digitizer

This post is licensed under CC BY 4.0 by the author.