Evaluation Process¶
The PUMA Challenge consists of two evaluation tracks.
For both tracks, submissions must include algorithms for Task 1 (tissue segmentation) and Task 2 (nuclei segmentation). To see example evaluation code, JSON annotations, and TIFF annotations, visit: - PUMA Challenge Evaluation - Track 1 - PUMA Challenge Evaluation - Track 2 - PUMA Challenge Baseline- Track 1 - PUMA Challenge Baseline- Track 2
The prediction outputs should include:
- Task 1: One
.tif
file with segmentation results and metadata, includingXResolution
,YResolution
,SMinSampleValue
(excluding background), andSMaxSampleValue
. - Task 2: One
.json
file containing nuclei segmentation results.
The labels in the tissue segmentation output should follow the class map:
Class | Label |
---|---|
tissue_white_background | 0 |
tissue_stroma | 1 |
tissue_blood_vessel | 2 |
tissue_tumor | 3 |
tissue_epidermis | 4 |
tissue_necrosis | 5 |
Ranking Metrics¶
Submissions are evaluated on two metrics:
-
Task 1 (Tissue Segmentation): The micro Dice Score is calculated by concatenating all segmentation results along one axis and then averaging the Dice score across all tissue classes. Tissue_white_background is not taken along for metric calculation.
-
Task 2 (Nuclei Segmentation): The macro F1-Score is determined using a hit criterion based on confidence score and centroid distance for each nuclei class. The evaluation process is as follows:
- Extract annotations: Nuclei predictions and ground-truth nuclei are extracted from JSON files, with centroids calculated for each polygon.
- Filter predictions: Predictions are censored based on a 15-pixel radius. Only predictions within this distance from any ground-truth nuclei are considered for further matching.
- Match predictions to ground-truth nuclei based on:
- If available, the highest confidence score.
- If not available, the nearest ground-truth nuclei is selected.
- Censor matched ground truth: Once a match is made, the corresponding ground-truth nuclei is marked as used and not considered for further matches.
- Class alignment: Check if the prediction and ground-truth nuclei classes align.
- If aligned, count as a True Positive (TP).
- If not aligned or no match is found, count as a False Positive (FP).
- Remaining unmatched ground-truth nuclei are counted as False Negatives (FN).
Final Ranking¶
Final rankings are based on the mean average rank across both tasks.
Good luck to all participants!