Tuesday, October 16, 2018

Dataset and Features




To feed a learner of Machine Learning (ML), we need data. But, rather than using image pixels, it is much more powerful to use ImageJ measurements (or features in the ML jargon) ...

Starting from the image of Fig. 1, we need to extract some measurements describing and characterizing the shapes.

Fig.1: Image of circles, triangles, and squares

1. Measurements in ImageJ

In ImageJ, many different measurements are available in Analyze > Set Measurements as shown in Fig. 2.

Fig. 2: Dialog Window with all the measurements available in ImageJ.

In this case, because we don't know which ones are best describing the shapes, I selected:
  • Area
  • Centroid (X- and Y-coordinates)
  • Fit ellipse (major, minor axes + angle)
  • Shape descriptors including:
    • Circularity
    • % of Area
    • Aspect Ratio
    • Roundness
    • Solidity
  • Feret's diameter (max and min diameters + angle + center)
Note: Centroid is only useful for display...

2. Image Features in ML Terminology

A dataset
+++ dataset +++
Area X Y Major Minor Angle Circ. Feret %Area FeretX FeretY FeretAngle MinFeret AR Round Solidity Vertices
1 405 422.483 15.547 22.832 22.585 110.833 0.712 28.46 100 418 2 108.435 21.04 1.011 0.989 0.914 4
2 577 28.5 16.282 27.121 27.089 90.000 0.947 28.302 100 17 8 147.995 27 1.001 0.999 0.96 0
3 1155 559.324 23.077 38.484 38.213 168.464 0.925 39.661 100 542 33 33.69 38 1.007 0.993 0.965 0
4 116 877.716 17.06 17.760 8.316 90.422 0.493 20.881 100 877 4 106.699 10.915 2.136 0.468 0.856 3
5 554 503.204 22.15 26.800 26.319 96.838 0.793 33.242 100 501 6 96.911 24.855 1.018 0.982 0.917 4
+++ End of dataset +++

In the ML terminology, the data (as displayed in a ImageJ table or a spreadsheet) is composed of rows (termed observations, examples, or feature vectors in ML) and each cell (or column) corresponds to a feature.

3. Dataset = Training + Test + Validation sets

In a ML project, three steps are usually carried out using different datasets. Indeed, it is very important that you don't use the same data for all the steps (no overlap between the various datasets). That's why the dataset is split in three non-overlapping subsets.
3.1. Training step (60% of the dataset)
During the training step, you are feeding the learner with the various features + the targets (labels) of the feature vectors (60% of the dataset). The targets correspond to the correct/expected outputs. Here, this is the type of shapes: triangle, circle, and square. Thus, for each observation, we need to set — by hand — the shape's type.

3.2. Cross-validation step (20% of the dataset)
In this step, you are comparing various models generated with several ML algorithms and/or tring to define the best parameters to generate the best model.
3.3 Test step ( (20% of the dataset)
Check the accuracy and quality of the prediction for the best model.

Note: In simple projects, the validation step is skipped and only the test step is done with 40% of the dataset.

4. Final dataset: IJ Measurements + Vertices


The complete dataset — downloadable here — contains 245 graphics objects with their ImageJ measurements + the targets  (circle, square, and triangle).
Note: The Results table in ImageJ only accepts numeric values. Thus, the shapes types were replaced by the number of vertices: 0 for a circle, 3 for a triangle, and 4 for a square. A column termed "Vertices" was added in the dataset.

Download this file in CSV (comma separted values) format and then in ImageJ, open it in a Results window with File > Open...

Note: The training set was obtained from Fig.1 by applying a Process > Binary > Make Binary, then Process > Binary > Watershed and Analyze > Analyze Particles... and then, the shapes types were visually assigned by appending a new column termed Vertices.



<<  Previous: Dataset Next: JS Toolkit  >>


5. Other crazybiocomputing posts

The complete dataset.

Further readings are available in ...

No comments:

Post a Comment