Evaluation Process

The PUMA Challenge consists of two evaluation tracks.

For both tracks, submissions must include algorithms for Task 1 (tissue segmentation) and Task 2 (nuclei segmentation). To see example evaluation code, JSON annotations, and TIFF annotations, visit: - PUMA Challenge Evaluation - Track 1 - PUMA Challenge Evaluation - Track 2 - PUMA Challenge Baseline- Track 1 - PUMA Challenge Baseline- Track 2

The prediction outputs should include:

  • Task 1: One .tif file with segmentation results and metadata, including XResolution, YResolution, SMinSampleValue (excluding background), and SMaxSampleValue.
  • Task 2: One .json file containing nuclei segmentation results.

The labels in the tissue segmentation output should follow the class map:

Class Label
tissue_white_background 0
tissue_stroma 1
tissue_blood_vessel 2
tissue_tumor 3
tissue_epidermis 4
tissue_necrosis 5

Ranking Metrics

Submissions are evaluated on two metrics:

  • Task 1 (Tissue Segmentation): The micro Dice Score is calculated by concatenating all segmentation results along one axis and then averaging the Dice score across all tissue classes. Tissue_white_background is not taken along for metric calculation.

  • Task 2 (Nuclei Segmentation): The macro F1-Score is determined using a hit criterion based on confidence score and centroid distance for each nuclei class. The evaluation process is as follows:

  1. Extract annotations: Nuclei predictions and ground-truth nuclei are extracted from JSON files, with centroids calculated for each polygon.
  2. Filter predictions: Predictions are censored based on a 15-pixel radius. Only predictions within this distance from any ground-truth nuclei are considered for further matching.
  3. Match predictions to ground-truth nuclei based on:
    • If available, the highest confidence score.
    • If not available, the nearest ground-truth nuclei is selected.
  4. Censor matched ground truth: Once a match is made, the corresponding ground-truth nuclei is marked as used and not considered for further matches.
  5. Class alignment: Check if the prediction and ground-truth nuclei classes align.
    • If aligned, count as a True Positive (TP).
    • If not aligned or no match is found, count as a False Positive (FP).
  6. Remaining unmatched ground-truth nuclei are counted as False Negatives (FN).

Final Ranking

Final rankings are based on the mean average rank across both tasks.

Good luck to all participants!