crazybiocomputing: Dataset and Features

To feed a learner of Machine Learning (ML), we need data. But, rather than using image pixels, it is much more powerful to use ImageJ measurements (or features in the ML jargon) ...

Starting from the image of Fig. 1, we need to extract some measurements describing and characterizing the shapes.

Fig.1: Image of circles, triangles, and squares

1. Measurements in ImageJ

In ImageJ, many different measurements are available in Analyze > Set Measurements as shown in Fig. 2.

Fig. 2: Dialog Window with all the measurements available in ImageJ.

In this case, because we don't know which ones are best describing the shapes, I selected:

Area
Centroid (X- and Y-coordinates)
Fit ellipse (major, minor axes + angle)
Shape descriptors including:

Circularity
% of Area
Aspect Ratio
Roundness
Solidity

Feret's diameter (max and min diameters + angle + center)

Note: Centroid is only useful for display...

2. Image Features in ML Terminology

A dataset

+++ dataset +++

	Area	X	Y	Major	Minor	Angle	Circ.	Feret	%Area	FeretX	FeretY	FeretAngle	MinFeret	AR	Round	Solidity	Vertices
1	405	422.483	15.547	22.832	22.585	110.833	0.712	28.46	100	418	2	108.435	21.04	1.011	0.989	0.914	4
2	577	28.5	16.282	27.121	27.089	90.000	0.947	28.302	100	17	8	147.995	27	1.001	0.999	0.96	0
3	1155	559.324	23.077	38.484	38.213	168.464	0.925	39.661	100	542	33	33.69	38	1.007	0.993	0.965	0
4	116	877.716	17.06	17.760	8.316	90.422	0.493	20.881	100	877	4	106.699	10.915	2.136	0.468	0.856	3
5	554	503.204	22.15	26.800	26.319	96.838	0.793	33.242	100	501	6	96.911	24.855	1.018	0.982	0.917	4

view raw shapes_dataset_small.csv hosted with ❤ by GitHub

+++ End of dataset +++

In the ML terminology, the data (as displayed in a ImageJ table or a spreadsheet) is composed of rows (termed observations, examples, or feature vectors in ML) and each cell (or column) corresponds to a feature.

3. Dataset = Training + Test + Validation sets

In a ML project, three steps are usually carried out using different datasets. Indeed, it is very important that you don't use the same data for all the steps (no overlap between the various datasets). That's why the dataset is split in three non-overlapping subsets.

3.1. Training step (60% of the dataset)

During the training step, you are feeding the learner with the various features + the targets (labels) of the feature vectors (60% of the dataset). The targets correspond to the correct/expected outputs. Here, this is the type of shapes: triangle, circle, and square. Thus, for each observation, we need to set — by hand — the shape's type.

3.2. Cross-validation step (20% of the dataset)

In this step, you are comparing various models generated with several ML algorithms and/or tring to define the best parameters to generate the best model.

3.3 Test step ( (20% of the dataset)

Check the accuracy and quality of the prediction for the best model.

Note: In simple projects, the validation step is skipped and only the test step is done with 40% of the dataset.

4. Final dataset: IJ Measurements + Vertices

The complete dataset — downloadable here — contains 245 graphics objects with their ImageJ measurements + the targets (circle, square, and triangle).

Note: The Results table in ImageJ only accepts numeric values. Thus, the shapes types were replaced by the number of vertices: 0 for a circle, 3 for a triangle, and 4 for a square. A column termed "Vertices" was added in the dataset.

Download this file in CSV (comma separted values) format and then in ImageJ, open it in a Results window with File > Open...

Note: The training set was obtained from Fig.1 by applying a Process > Binary > Make Binary, then Process > Binary > Watershed and Analyze > Analyze Particles... and then, the shapes types were visually assigned by appending a new column termed Vertices.

<< Previous: Dataset Next: JS Toolkit >>

5. Other crazybiocomputing posts

The complete dataset.

Further readings are available in ...

Machine Learning Glossary
Machine Learning in ImageJ Series [Link]
JavaScript/ECMAScript TOC [Link]

crazybiocomputing

Pages

Tuesday, October 16, 2018

Dataset and Features