Tuesday, October 16, 2018

Dataset and Features




To feed a learner of Machine Learning (ML), we need data. But, rather than using image pixels, it is much more powerful to use ImageJ measurements (or features in the ML jargon) ...

Starting from the image of Fig. 1, we need to extract some measurements describing and characterizing the shapes.

Fig.1: Image of circles, triangles, and squares

1. Measurements in ImageJ

In ImageJ, many different measurements are available in Analyze > Set Measurements as shown in Fig. 2.

Fig. 2: Dialog Window with all the measurements available in ImageJ.

In this case, because we don't know which ones are best describing the shapes, I selected:
  • Area
  • Centroid (X- and Y-coordinates)
  • Fit ellipse (major, minor axes + angle)
  • Shape descriptors including:
    • Circularity
    • % of Area
    • Aspect Ratio
    • Roundness
    • Solidity
  • Feret's diameter (max and min diameters + angle + center)
Note: Centroid is only useful for display...

2. Image Features in ML Terminology

A dataset
+++ dataset +++ +++ End of dataset +++

In the ML terminology, the data (as displayed in a ImageJ table or a spreadsheet) is composed of rows (termed observations, examples, or feature vectors in ML) and each cell (or column) corresponds to a feature.

3. Dataset = Training + Test + Validation sets

In a ML project, three steps are usually carried out using different datasets. Indeed, it is very important that you don't use the same data for all the steps (no overlap between the various datasets). That's why the dataset is split in three non-overlapping subsets.
3.1. Training step (60% of the dataset)
During the training step, you are feeding the learner with the various features + the targets (labels) of the feature vectors (60% of the dataset). The targets correspond to the correct/expected outputs. Here, this is the type of shapes: triangle, circle, and square. Thus, for each observation, we need to set — by hand — the shape's type.

3.2. Cross-validation step (20% of the dataset)
In this step, you are comparing various models generated with several ML algorithms and/or tring to define the best parameters to generate the best model.
3.3 Test step ( (20% of the dataset)
Check the accuracy and quality of the prediction for the best model.

Note: In simple projects, the validation step is skipped and only the test step is done with 40% of the dataset.

4. Final dataset: IJ Measurements + Vertices


The complete dataset — downloadable here — contains 245 graphics objects with their ImageJ measurements + the targets  (circle, square, and triangle).
Note: The Results table in ImageJ only accepts numeric values. Thus, the shapes types were replaced by the number of vertices: 0 for a circle, 3 for a triangle, and 4 for a square. A column termed "Vertices" was added in the dataset.

Download this file in CSV (comma separted values) format and then in ImageJ, open it in a Results window with File > Open...

Note: The training set was obtained from Fig.1 by applying a Process > Binary > Make Binary, then Process > Binary > Watershed and Analyze > Analyze Particles... and then, the shapes types were visually assigned by appending a new column termed Vertices.



<<  Previous: Dataset Next: JS Toolkit  >>


5. Other crazybiocomputing posts

The complete dataset.

Further readings are available in ...

No comments:

Post a Comment