Custom Automatic Segmentation Models for Medicine and Biology

based on FastSAM

SANTIAGO PARAMÉS-ESTÉVEZ1,2, DIEGO PÉREZ-DONES3,4, IGNACIO REGO-PÉREZ5,

NATIVIDAD OREIRO-VILLAR5, FRANCISCO J. BLANCO5, JAVIER ROCA PARDIÑAS6,

GERMÁN GONZÁLEZ PAZÓ7, DAVID G. MÍGUEZ3,4, ALBERTO P. MUÑUZURI1,2

1Group of NonLinear Physics,

University of Santiago de Compostela,

Facultade de Física, Rúa Xosé María Suárez Núñez, s/n, 15782, Santiago de Compostela,

SPAIN

2Centro de Investigación e Tecnoloxía Matemática de Galicia, CITMAga,

Plaza do Obradoiro, Colexio de San Xerome, s/n, 15705, Santiago de Compostela,

SPAIN

3Departamento de Física de la Materia Condensada,

University Autónoma de Madrid,

Avda. Reina Mercedes, s/n, 28792, Miraflores de la Sierra, Madrid,

SPAIN

4Centro de Biología Molecular Severo Ochoa,

University Autónoma de Madrid,

Calle de Lavoisier, 4, 28049, Madrid,

SPAIN

5Servicio de Reumatologia. GIR-INIBIC,

Hospital Universitario de A Coruña, Sergas, University of A Coruña,

As Xubias, 84, 15006 A Coruña,

SPAIN

6Department of Statistics and O.R. & SiDOR Group,

University of Vigo,

Faculty of Economic and Business Sciences, As Lagoas, Marcosende, 36310, Vigo,

SPAIN

7Healthcare innovation Advisor,

Merasys,

Avenida Ramiro Pascual S/N- Nave C 36213, Vigo, Pontevedra,

SPAIN

Abstract: - FastSAM, a public image segmentation model trained on everyday images, is used to achieve a

customizable and state-of-the-art segmentation model minimizing the training in two completely different

scenarios. In one example we consider macroscopic X-ray images of the knee area. In the second example,

images were acquired by microscopy of the volumetric zebrafish embryo retina with a much smaller spatial

scale. In both cases, we analyze the minimum set of images required to segmentate keeping the state-of-the-art

standards. The effect of filters on the pictures and the specificities of considering a 3D volume for the retina

images are also analyzed.

Key-Words: - Automatic segmentation, FastSAM, X-ray images, microscopy images, Low-Resource Friendly,

Generalizable Approach.

Received: April 19, 2024. Revised: October 6, 2024. Accepted: November 9, 2024. Published: December 13, 2024.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

373

Volume 21, 2024

1 Introduction

Training and designing an image segmentation

model from scratch is not accessible to everyone or

every project. In most cases, the problem is related

to the impossibility of accessing large numbers of

images. Achieving custom competent models

requires experts, enormous amounts of data,

computational power, and time. These resources are

scarce for small companies and investigation

groups, leading to an undesired imbalance in our

globalized world.

It has been noted in the literature that, while

large groups develop and furnish existing ideas with

their abundant resources, underfunded small groups

still are who, proportionally, propose more

groundbreaking ideas that revolutionize science, as

studied in [1]. One of the main advantages of large

teams is the ability to access big databases and

computational resources. Achieving similar results

with less data and power would greatly benefit small

groups and, therefore, science.

Automatic segmentation is becoming

increasingly relevant, specifically in medical and

biological environments, where large amounts of

images must be processed to screen patients or to

supervise the evolution of biological experiments,

examples can be found in completely different fields

ranging from cardiology, oncology, radiology,

biology in general, cell segmentation, etc. as can be

seen in studies [2], [3], [4], [5], [6]. Also, a study

with more details on the benefits of implementing

automatic segmentation in existing workflows can

be found in [7].

In the absence of public or commercial tools for

a specific case of study, by default, this task tends to

be performed manually resulting, in some cases, in

suboptimal health service or a reduction in the scope

of the biological studies. Also, some useful data can

be abandoned instead of being used to train models

to accelerate or even completely automatize

acquisition or labeling processes. With this work,

we would like to help small groups develop their

custom models by showing the viability of our

approach for creating performant models with as

few resources as possible.

In this direction, several studies have been

conducted to apply recent general segmentation

models like the Segment Anything Model (SAM)

from MetaAI, described in [8]. In [9], SAM is

directly evaluated on tens of thousands of medical

images extracted from openly available datasets.

The model requires the user to specify points or

regions of interest to segment the desired object.

This general approach works very well in normal

photographs, where objects are usually easy to

identify. Nevertheless, for medical or biological

images, the result is not always as good as expected

or requires too much user input to be used in an

automated framework. The alternative is to finetune

SAM and for that in [10] a dataset of 1.5 million

images was developed and used in conjunction with

20 A100 GPUs to train and test MedSAM. This

model is a finetuned version of SAM that achieves

high performances at segmenting several kinds of

medical conditions (tumors, cuts, dark spots, etc.)

and image types (X-ray, CT, MRI, etc.). It should be

noted that having access to all those resources is not

trivial, even for big companies. Therefore,

discovering approaches to achieve similar results

with a small fraction of that computational power

and data has a lot of interest for reproducibility and

potential future studies. That is why we will explore

in this study the possibilities of using FastSAM for

similar purposes since it claims to be a lighter

alternative to SAM as shown in [11].

This manuscript will use two completely

different sets of images to prove this approach's

viability. On the one hand, knee X-ray images from

the OAI (OstheoArthitis Initiative), presented in

[12] and on the other, microscopical images of the

3D retina nuclei from a zebrafish embryo. In the

first case, the bones observed are, in most of the

cases, clearly separated from the surrounding tissue

while, in the second example, the objects to analyze

are composed of a myriad of small objects, thus,

complicating the task of recognizing the ensemble

for a non-trained eye.

Our approach consists of finetuning FastSAM

with unconventional data to see if its behavior can

be generalized to other more scientific non-trivial

settings. FastSAM can tell apart objects with clear

boundaries, but more subtle cases, that require

instruction even for human eyes, are more

complicated to solve.

To tackle this problem with as little data as

possible, we will trade FastSAM’s generality for the

performance at finding a single type of object,

reducing drastically the resources needed to achieve

significant results.

In this work, we demonstrate how FastSAM, a

public image segmentation model trained on

everyday images, can be used to achieve a

customizable and state-of-the-art segmentation

model with very few resources.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

374

Volume 21, 2024

The next section describes the images used and

their specificities. Also, the parameters that describe

the goodness of the training are introduced in this

section. The following section shows the results for

the two cases considered and the manuscript

finishes with a discussion and conclusions section.

2 Background and Methodology

To approach the problem of giving access to custom

segmentation models to small groups, we tried to

move away from conventional segmentation models

such as U-Net, which are more challenging to

implement and may require the presence in the

research group of a person with experience in

artificial intelligence, which may not be available to

all small groups. Nevertheless, in [13], similar

efficiencies for FastSAM and U-Net were observed

when segmenting brain tissue (Dice score of ~0.95).

Said work relied on the general capabilities of the

pre-trained FastSAM model to segment the visible

brain during surgery. This was done by giving it an

ROI (region of interest), which in their setting is

easier to set since the camera is always fixed. If the

camera were to move, this process would benefit

from finetuning FastSAM as we will show in this

work.

Another alternative, as commented in the

previous section, is using directly SAM, but this

also has some disadvantages, like how heavy the

model is to train and evaluate. Also, in [13], an

implementation with SAM was made, but it was

determined impractical due to high computational

times, a lot greater than the time needed to do a

manual segmentation. This shows the relevance of

evaluation speeds for our purpose, the models

obtained must be fast to be useful.

FastSAM is intended to be a model capable of

segmenting any object with a prompt from the user,

just like SAM, except FastSAM also allows text

prompts. When fine-tuning FastSAM to only

recognize one class of name “object”, the human

interaction is removed by simply prompting said

keyword to get the segmentation.

All of this makes FastSAM a very powerful

candidate for custom design and automatic

segmentation models with few resources as we will

show in this section.

2.1 Model

FastSAM is a model based on YOLOv8, presented

in [14], that has been trained with 2% of all SA-1B

datasets. It achieves similar results to SAM, as

shown in [8], but 50 times faster. To finetune it, the

datasets must be crafted in the COCO format,

described in [15].

The model is designed so it can take images

with any aspect ratio. To do so, the shortest

dimension is padded to create a square image and

then scaled to the size specified when loading the

model to train or evaluate. This is particularly useful

for evaluating X-ray images from different

equipment, which may have varying resolutions and

shapes.

The models were trained in a node with 5

Nvidia A100 GPUs hosted by CESGA. All the

models trained for this work were evaluated on a

desktop computer with an Intel i5-13500 CPU

(notice that this is common equipment for any lab).

The parameters chosen to train the models and

approximate training and inference times can be

seen in Table 1.

Table 1. Summary of the parameters chosen to train

models with each type of image. Image size has a

noticeable impact on training time, but inference

time per image is unchanged.

Model

Train Configuration

Times

Epochs

Image Size

Batch

Train

(min)

Evaluate

(s/image)

Tibia

200

1536x1536

~50

Retina

200

1024x1024

~15

2.2 Data Acquisition and Preprocessing

We will be working with large-scale X-ray images

and microscopical images of a 3D object to show

the viability of this method independently on the

application, scale of the objects analyzed, etc.

X-ray data was acquired from the OAI, a

database with more than 4000 patients with images

periodically taken over up to 10 years with several

devices at different hospitals. Each image has two

knees, to train the model they were separated into

two images, so the model only sees one knee at a

time. Originally, images were stored in 16-bit but

later converted to 8-bit RGB images in grayscale

(same value for each pixel at the three channels).

Suboptimal acquisition conditions (i.e., wrong

placement of the patient in the X-ray machine, non-

standard functional settings of the machine,

existence of spurious objects, etc.) can lead to the

appearance of fog and illumination gradients in X-

ray images, hiding from the human eye features that

can be hard to see even under normal conditions. To

aid in the manual segmentation of tibias, the knee

images were enhanced with CLAHE (Contrast

Limited Adaptive Histogram Equalization) which

has been reported to be a good filter for medical

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

375

Volume 21, 2024

images in [16], [17]. This filter reduces both effects,

clarifying most of the cases.

The segmented region of the tibia excludes

areas that become visible when the knee is slightly

rotated during the image acquisition. Under ideal

conditions the front and the back of the tibia’s head

align, showing a clear border in the image. When

the bone is rotated (due to imperfections during the

image acquisition process or to the non-standard

shape of the patient’s knee) the desired border

becomes a lot less visible due to the superposition.

The images were meticulously segmented in

collaboration with a medical specialist to find this

border. The process was repeated for the femur, but

its discussion will be omitted, since it must be

segmented as a whole, and the problem is not as

challenging (nevertheless a summary is presented in

Appendix A).

A transgenic zebrafish line was used in the

experiment to obtain the other set of images on the

microscopic scale, expressing a fusion protein

composed of histone2b and RFP which labels all

nuclei in the fish. Embryos were collected at around

20 hours post fertilization, based on visual

inspection, as specified in [18] and immobilized for

imaging in a 1.5% agarose gel matrix. Images were

taken in a confocal microscope STELLARIS 8

coupled to an inverted microscope model DMi8

(Leica), capturing the whole volume of the retina,

with a section thickness of 1µm and overlapping

between sections of 0.2µm. Typically, 151 retina

sections are taken at every instant of time, the first

corresponding to the upper part of the retina and

then moving deeper into it. Frames were taken with

a 1h time-step between them. A white laser diode of

the microscope was adjusted to emit at 555 nm in

wavelength to maximize fluorophore excitation and

emission.

Histone 2b is a well-characterized protein

produced by all cells which works as a scaffold for

chromatin packaging, as described in [19]. This

protein, along with other histones, has been

extensively used along with fluorescent proteins, to

label cell nuclei in in vivo tissues.

In our specific case, this histone has been fused

with RFP, a fluorescent protein that has reddish

fluorescence when excited, described in [20]. The

general idea of the process is, as H2B protein is

expressed in all cells and located in their nuclei,

when the fusion protein is exposed to a specific light

wavelength (555nm) the fluorophore emits

fluorescence in a different wavelength (583nm)

which is then captured by the microscope camera.

By this means, each H2B-RFP present in the nuclei

of the tissue would emit a fluorescent signal that can

be later analyzed.

In other words, the fusion protein provides two

different things, first H2B gives certainty of nuclear

localization of the fluorescent protein, whereas RFP

is the one in charge of providing a signal which can

be captured in a fluorescence microscope.

The animals are maintained and bred according

to protocols established in [21].

To improve cell nuclei visibility, images were

also enhanced with a global histogram contrast

equalization and then normalized. This effect is

most noticeable at the bottom slices of each time

frame, where light has to transverse more tissue and

gets absorbed easily due to light scattering, letting

less light reach the microscope.

2.3 Statistical and Control Parameters

To evaluate the performance of the trained models

we have chosen as evaluation parameters precision

(fraction of the prediction correctly guessed), recall

(fraction of the ground truth correctly predicted),

and the Dice or F1 score, as seen in the bibliography

[2], defined in equations (1) and (2),

Precision = 𝑇𝑃

𝑇𝑃+𝐹𝑃; Recall = 𝑇𝑃

𝑇𝑃+𝐹𝑁, 



Dice score = 2𝑇𝑃

2𝑇𝑃+𝐹𝑃+𝐹𝑁 , 

where TP, TN, FP, and FN are the fractions of

pixels classified as true positives, true negatives,

false positives, and false negatives, respectively [2],

[3] .

The performances of the retina models were

also compared with the mean image brightness and

the number of retina nuclei in each slice.

The number of objects per slice was obtained

using an in-house developed algorithm, described in

[22] and based on top-hat transform. To count the

retina nuclei, the algorithm was fed with the images

intersected with the ground truth, so only the nuclei

of interest can be counted.

2.4 Dataset Creation

All the segmentations were stored as binary images

and converted to the COCO dataset format to train

FastSAM, to generate masks with holes for the

retina dataset, code from [23] was used.

A dataset consists of two folders, one with the

images and the other with a text file per image,

where the contour of a mask is specified as a row

per object. The first number of each row represents

the class associated with its object, while the rest are

its contour points stored as alternating pairs of x and

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

376

Volume 21, 2024

y normalized coordinates (x1,y1,x2,y2,…,xn,yn). In

this work training samples only have one object per

image and models were trained with only one class.

These changes greatly reduce the complexity of the

task, reducing the amount of information required to

build a performant model.

The total number of segmented images is 166

tibias and 1960 retinas. Since the objective is to

achieve usable models with a training set of images

as small as possible, only some of those images

were used to train the models with increasing

amounts of training samples.

The number of samples used to train each model

was increased progressively by adding more images

to the original set (3, 15, 41, 83, and 125 for the

tibia and 124, 248, and 372 for the retina). The data

that was not included in the training sets was used to

test all the models, keeping the evaluation set

constant ensures all the models are evaluated under

the same conditions. FastSAM was finetuned with

the original images as well as with their enhanced

versions to study if the improvement in visibility for

the human eye also helps FastSAM achieve better

results.

The combination of all these changes results in

a total of 16 different models, 10 for tibias and 6 for

retinas, that will be evaluated in the following

section.

3 Results

A total of 16 models were evaluated using their

corresponding test datasets (10 for tibias and 6 for

retinae). Dice score, recall, and precision values for

each one can be found in Fig. 1, it seems the mean

performance of all the tibia models is almost

independent of whether it was filtered or in the

number of images used to train the model. This is

not the case for the retina model, which seems to

benefit from using enhanced images or more

training data. More information can be found in

Table 2, the average of each one of the distributions

shown in Fig. 1 is indicated, these parameters have

also been calculated using femur segmentations as

shown in the Appendix, in Table A1 and Figure A1

in Appendix, respectively. This gives a numerical

reference of the performance for all the trained

models. Increasing the number of training images

has a slight positive effect on the performance of

tibia models shown by a subtle increase in the

average dice score. The effect of enhancing the tibia

images is almost imperceptible, with 3 train samples

the performance decreases compared with its non-

enhanced counterpart, but for the rest of the models,

the performance seems to be independent of the

number of training samples.

A possible explanation for this phenomenon

could be that standardizing the brightness reduces,

even more, the complexity of the dataset, making

the gradual progression observed for the non-

enhanced case imperceptible. In other words, a

performant model is achieved faster. This effect is

also present in the retina models. The non-enhanced

model with fewer train images has much worse

performance than its enhanced counterpart, probably

due to the diversity in brightness in the retina set of

slices, dependent on depth and time of acquisition.

This can be seen, for example, in Fig. 3a, where the

image is brighter the closer it gets to the eye surface

(slice 15) and darker at the deepest slices (slice

140). In contrast, all enhanced retina images (i.e.

Fig. 3d) have similar brightness levels and therefore

the models can generalize more easily.

Graphic comparisons between enhanced and

non-enhanced models are shown in Fig. 2 for tibia

and, in the Appendix, Figure A2, for femur. Once

again knee model performances are almost

invariant, while the retina models seem to benefit

from increasing the number of patients. Once again,

this makes sense, since one frame (the whole 3D

image of the retina) has 151 different slices, training

with fewer data means more extrapolation to unseen

slices and thus, more failure. Also, the set of slices

obtained for a retina are quite different depending

on their position, therefore, a significant number of

slices covering the whole retina needs to be used for

training to achieve acceptable results. We observe

that for all the cases considered the performance of

the models is very high, only in the retina images

the performance is observed to improve

significantly as the number of training images

increases up to 248 and ahead. Also, note that

filtering the images to enhance the contrast did not

result in any significant advantage (except for the

one trained with 124 zebra fish images). To

summarize, the performance of a retina model

trained with few images is far worse than its

analogous for tibias. This is related to the fact that

each zebrafish image is taken at a different focal

plane and, thus, effectively corresponds with a

different object as the retina geometry differs

greatly.

All knee models have very similar and high

performances and this is exemplified in Fig. 2,

where a slight improvement in the dice score can be

appreciated with the increase in the number of

patients used to train. This is particularly relevant in

the areas close to the boundary, which are better

detected as the number of training images increases.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

377

Volume 21, 2024

Since the most difficult section of the tibias is in the

head, to check for a bias in the scores due to long

straight bone regions, they were recalculated for

each model using only the upper third of each

segmentation. Since the results were almost

identical to those shown in Fig. 1, they have been

omitted.

The retina model trained with the least amount

of non-enhanced images had the lowest performance

overall (Fig. 3a), where the predicted masks

couldn’t accurately find the retina nuclei. With more

training images, the retina is properly recovered.

Fig. 1: Comparison of the performance of each model with enhanced and non-enhanced images. (a, b, c) Tibias.

(d,e,f) Retinae. The distribution of values is shown with box plots, which mark the values from Q1 to Q3 with a

box, the median is marked with a black vertical line in each box. Points found further than 1.5 times the box

length are marked as outliers with small translucent circles. In blue, models trained with Non-Enhanced (NE)

images, while in orange; with them Enhanced (E)

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

378

Volume 21, 2024

Table 2. Summary of the Dice, Recall (Rec.), and

Precision (Prec.) average values achieved for each

model

Non Enhanced

Enhanced

Training

images

Dice

Rec.

Prec.

Dice

Rec.

Prec.

Tibia

0.951

0.939

0.967

0.811

0.786

0.960

0.965

0.960

0.971

0.969

0.970

0.969

0.966

0.964

0.968

0.969

0.970

0.968

0.969

0.970

0.968

0.969

0.968

124

0.970

0.972

0.968

0.969

0.972

0.967

Retina

124

0.503

0.925

0.360

0.856

0.848

0.897

248

0.843

0.856

0.872

0.870

0.868

0.886

372

0.869

0.875

0.889

0.862

0.860

0.899

Fig. 2: Contours of ground truth (white) and

predictions (red). Rows a), b), c), d), and e) were

trained with the original images, while f), g), h), i),

and j) with their enhanced counterparts. At the left

of each row, the number of images used to finetune

the applied model is shown

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

379

Volume 21, 2024

Fig. 3: Contours of ground truth (white) and

predicted masks for each model (red). To illustrate

the performance of the model at three sections of the

retina (beginning, middle, and end), the index of the

corresponding slices has been marked with its index

on top of each column, the total number of slices per

frame is 151. Rows a), b), and c) were evaluated

with models trained and evaluated with 124, 248,

and 372 of the original images, respectively, while

rows d), e), and f); with 124, 248, and 372 of the

enhanced images

It is important to note that, for this example, the

retina is composed of many small-size objects

(making it difficult for the non-trained eye to detect

them) but the models can recover the structure with

high precision when the number of training images

is above 228.

As indicated in the methods section, the retinae

correspond with a 3D structure, thus, for each

experiment a stack of images is acquired

corresponding with different sections of the retina.

Typically, between 100 and 200 sections are

acquired. In the present case, 151 slices were

captured at each instant of time. Fig. 4 shows the

quality indicators for each frame averaging over 11

different acquisitions comparing enhanced and non-

enhanced images. For each slice (corresponding

with a particular section of the retina) we plot the

dice score, the average light intensity in the image,

and the number of nuclei (the small objects that

constitute the whole retina). Note that when non-

enhanced images are used (Fig. 4a), the Dice score

does not perform well at all depths, in particular, the

first and latest slices are badly recovered.

Nevertheless, when the initial enhancing filter is

applied (Fig. 4b), the dice score improves, and the

latest slices are now well recovered. This indicates

that if the number of constituents of the retina

(nuclei) is large (as it happens in the latest slices),

the contrast enhancement may play a significant

role, while for the first slices as the number of

nuclei is small, increasing the contrast does not

improve the result.

Fig. 4: Several retina image properties are compared

with the Dice scores at each frame

The curves were obtained by averaging the

values measured at each slice at 11 human-labeled

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

380

Volume 21, 2024

frames, the shaded regions correspond to the values

closer to one standard deviation from the mean.

Both models were trained with 372 images. Pixel

intensity and number of cells were normalized. a)

shows values obtained for non-enhanced images,

while b) for their enhanced counterpart.

In this section, we have characterized the

performance of finetuned models for each dataset

and how the variation of some of the parameters,

like the number of train samples or the

preprocessing, affect their overall performance. We

have also proven how our approach can be applied

to both 2D and 3D images, very common types of

data in both medical and biological settings, but also

several branches of science and engineering. As a

result, we have obtained very performant, specific,

and light models that anyone with enough labeled

images can reproduce.

4 Conclusion

We demonstrate in this manuscript that the use of a

model trained with general images such as

FastSAM, makes it possible to achieve competitive

results for specific non-trivial applications. We

consider two examples with images acquired by

completely different means and scales,

corresponding to situations with several difficulties.

In both cases the protocol succeeded, and the

segmentation was possible even after training the

models with a small number of images. Additional

cases such as the femur (also present in the X-ray

images) were also analyzed and successfully

segmented (see results in Appendix A).

With the zebrafish retina, we have also shown

that for volumetric images a single model is enough

to obtain a performant segmentation model. And

that experimental image acquisition conditions can

be filtered to improve the performance of the model.

This study also shows how finetuning FastSAM

can be advantageous since it gives a lot of flexibility

in the format of the input data, and also is faster and

cheaper to achieve than finetuning SAM. It also

requires less knowledge than training a U-Net (or

finetuning if a compatible model is available) or

other typical image segmentation models from the

literature. All these factors prove once again that

this approach is not only very viable for low-

resources or non-expert groups but also useful to

expand the frontiers of science, giving power to

those who lack resources but exude innovation.

Following the approach presented in this

contribution, it is possible to envision a path to

develop equivalent tools for many diverse

applications in a great variety of systems such as the

ones presented here. We demonstrated that our

training mechanism achieves results with equivalent

accuracy to other non-pretrained methods, thus,

competing with state-of-the-art models by taking

advantage of all the information already embedded

in FastSAM.

Acknowledgement:

Model training was done at CESGA (Center of

Supercomputation of Galicia) and we acknowledge

their support.

References:

[1] L. Wu, D. Wang, and J. A. Evans, “Large

teams develop and small teams disrupt

science and technology,” Nature, vol. 566,

no. 7744, pp. 378–382, Feb. 2019, doi:

10.1038/s41586-019-0941-9.

[2] B. Serrano-Antón, A. Otero-Cacho, D.

López-Otero, B. Díaz-Fernández, M. Bastos-

Fernández, V. Pérez-Muñuzuri, J. R.

González-Juanatey, and A. P. Muñuzuri,

“Coronary Artery Segmentation Based on

Transfer Learning and UNet Architecture on

Computed Tomography Coronary

Angiography Images,” IEEE Access, vol. 11,

pp. 75484–75496, 2023, doi:

10.1109/ACCESS.2023.3293090.

[3] S. P. Primakov, A. Ibrahim, J. E. van

Timmeren, G. Wu, S. A. Keek, M. Beuque,

R. W. Y. Granzier, E. Lavrova, M.

Scrivener, S. Sanduleanu, E. Kayan, I.

Halilaj, A. Lenaers, J. Wu, R. Monshouwer,

X. Geets, H. A. Gietema, L. E. L. Hendriks,

O. Morin, et al., “Automated detection and

segmentation of non-small cell lung cancer

computed tomography images,” Nat

Commun, vol. 13, no. 1, p. 3423, Jun. 2022,

doi: 10.1038/s41467-022-30841-3.

[4] P. Cheng, Y. Yang, H. Yu, and Y. He,

“Automatic vertebrae localization and

segmentation in CT with a two-stage Dense-

U-Net,” Sci Rep, vol. 11, no. 1, p. 22156,

Nov. 2021, doi: 10.1038/s41598-021-01296-

[5] T. Piotrowski, O. Rippel, A. Elanzew, B.

Nießing, S. Stucken, S. Jung, N. König, S.

Haupt, L. Stappert, O. Brüstle, R. Schmitt,

and S. Jonas, “Deep-learning-based multi-

class segmentation for automated, non-

invasive routine assessment of human

pluripotent stem cell culture status,” Comput

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

381

Volume 21, 2024

Biol Med, vol. 129, p. 104172, Feb. 2021,

doi: 10.1016/j.compbiomed.2020.104172.

[6] C. Wen, M. Matsumoto, M. Sawada, K.

Sawamoto, and K. D. Kimura, “Seg2Link:

an efficient and versatile solution for semi-

automatic cell segmentation in 3D image

stacks,” Sci Rep, vol. 13, no. 1, p. 7109, May

2023, doi: 10.1038/s41598-023-34232-6.

[7] G. R. Sarria, F. Kugel, F. Roehner, J. Layer,

C. Dejonckheere, D. Scafa, M. Koeksal, C.

Leitzen, and L. C. Schmeel, “Artificial

Intelligence–Based Autosegmentation:

Advantages in Delineation, Absorbed Dose-

Distribution, and Logistics,” Adv Radiat

Oncol, vol. 9, no. 3, p. 101394, Mar. 2024,

doi: 10.1016/j.adro.2023.101394.

[8] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C.

Rolland, L. Gustafson, T. Xiao, S.

Whitehead, A. C. Berg, W.-Y. Lo, P. Dollár,

and R. Girshick, “Segment Anything,”

ArXiv, Apr. 2023.

[9] M. A. Mazurowski, H. Dong, H. Gu, J.

Yang, N. Konz, and Y. Zhang, “Segment

Anything Model for Medical Image

Analysis: an Experimental Study,” ArXiv,

Apr. 2023, doi:

10.1016/j.media.2023.102918.

[10] J. Ma, Y. He, F. Li, L. Han, C. You, and B.

Wang, “Segment anything in medical

images,” Nat Commun, vol. 15, no. 1, p. 654,

Jan. 2024, doi: 10.1038/s41467-024-44824-

[11] X. Zhao, W. Ding, Y. An, Y. Du, T. Yu, M.

Li, M. Tang, and J. Wang, “Fast Segment

Anything,” ArXiv, Jun. 2023.

[12] G. Lester, “The Osteoarthritis Initiative: A

NIH Public–Private Partnership,” HSS

Journal, vol. 8, no. 1, pp. 62–63, Feb. 2012,

doi: 10.1007/s11420-011-9235-y.

[13] C. Li, X. Fan, R. B. Duke, K. L. Chen, L. T.

Evans, and K. D. Paulsen, “Intraoperative

stereovision cortical surface segmentation

using fast segment anything model,” in

Medical Imaging 2024: Image-Guided

Procedures, Robotic Interventions, and

Modeling, SPIE, Mar. 2024, p. 21. doi:

10.1117/12.3006873.

[14] G. Jocher, A. Chaurasia, and J. Qiu,

“Ultralytics YOLO,” GitHub. Github, 2023.

Accessed: May 29, 2024. [Online].

Available:

https://github.com/ultralytics/ultralytics

[15] T.-Y. Lin, M. Maire, S. Belongie, L.

Bourdev, R. Girshick, J. Hays, P. Perona, D.

Ramanan, C. L. Zitnick, and P. Dollár,

“Microsoft COCO: Common Objects in

Context,” ArXiv, May 2014.

[16] R. Sharma and A. Kamra, “A Review on

CLAHE Based Enhancement Techniques,”

in 2023 6th International Conference on

Contemporary Computing and Informatics

(IC3I), IEEE, Sep. 2023, pp. 321–325. doi:

10.1109/IC3I59117.2023.10397722.

[17] B. Kurt, V. V. Nabiyev, and K. Turhan,

“Medical images enhancement by using

anisotropic filter and CLAHE,” in 2012

International Symposium on Innovations in

Intelligent Systems and Applications, IEEE,

Jul. 2012, pp. 1–4. doi:

10.1109/INISTA.2012.6246971.

[18] C. B. Kimmel, W. W. Ballard, S. R. Kimmel,

B. Ullmann, and T. F. Schilling, “Stages of

embryonic development of the zebrafish,”

Developmental Dynamics, vol. 203, no. 3,

pp. 253–310, Jul. 1995, doi:

10.1002/aja.1002030302.

[19] S. P. Khare, F. Habib, R. Sharma, N.

Gadewal, S. Gupta, and S. Galande,

“HIstome--a relational knowledgebase of

human histone proteins and histone

modifying enzymes,” Nucleic Acids Res, vol.

40, no. D1, pp. D337–D342, Jan. 2012, doi:

10.1093/nar/gkr1125.

[20] A. Miyawaki, D. M. Shcherbakova, and V. V

Verkhusha, “Red fluorescent proteins:

chromophore formation and cellular

applications,” Curr Opin Struct Biol, vol. 22,

no. 5, pp. 679–688, Oct. 2012, doi:

10.1016/j.sbi.2012.09.002.

[21] M. Westerfield, The zebrafish book. A guide

for the laboratory use of zebrafish (Danio

rerio), 4th ed. Univ. of Oregon Press,

Eugene., 2000.

[22] M. Ledesma-Terrón, D. Pérez-Dones, D.

Mazó-Durán, and D. G. Míguez, “High-

throughput three-dimensional

characterization of morphogenetic signals

during the formation of the vertebrate

retina,” bioRxiv, Apr. 2024,

https://doi.org/10.1101/2024.04.09.588672.

[23] J. Glenn, “ultralytics/COCO2YOLO:

Improvements.” Zenodo, May 11, 219AD.

doi: 10.5281/zenodo.2738322.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

382

Volume 21, 2024

APPENDIX

The same techniques used on the tibia have

been applied to the femur, resulting in 10 additional

models evaluated under the same conditions.

Table A1. Summary of the average value of Dice

score (Dice), recall (Rec.), and precision (Prec.) for

each model

Non Enhanced

Enhanced

Training

images

Dice

Rec.

Prec.

Dice

Rec.

Prec.

Femur

0.883

0.846

0.978

0.864

0.828

0.979

0.973

0.977

0.969

0.971

0.972

0.970

0.974

0.975

0.973

0.974

0.975

0.973

0.974

0.976

0.973

0.974

0.975

0.973

124

0.975

0.976

0.974

0.975

0.973

Fig. A1: Graphic summaries of all test images

evaluated with all femur models. In blue, models

trained with Non-Enhanced (NE) images, while in

orange; with them Enhanced (E)

Given the nature of femur segmentations, the

problem is more similar to what FastSAM was

trained for, the whole segmentation of an object. As

shown in Fig. A1, this translates into a very good

performance as soon as a few femurs are shown to

the model, adjusting it to find this type of object but

applying the same strategy as the one used in its

original training dataset SA-1B. Also, no significant

differences were found between models trained with

the same amounts of both types of images

(enhanced and non-enhanced). In Table A1, a

numerical representation of the performance of each

femur model is shown with the average of the dice

score distributions shown in Fig. A1. Identically to

the case studied for the tibia, the non-enhanced

models gradually increase their performance with

more train samples, while the enhanced version

saturates faster, needing fewer samples to achieve

similar results. Probably because of a reduction in

the complexity of the problem due to

standardization in illumination conditions and

visibility overall.

In Fig. A2 examples of the performance of each

model are shown for the femur. With 3 patients in

the training dataset, the shape is almost completely

captured and with 15 training images onwards the

predictions get asymptotically better.

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

383

Volume 21, 2024

Fig. A2: Contours of ground truth (white) and

predictions (red). Rows a), b), c), d), and e) were

trained with the original images, while f), g), h), i),

and j) with their enhanced counterparts. At the left

of each row, the number of images used to finetune

the applied model is shown.

Contribution of Individual Authors to the

Creation of a Scientific Article (Ghostwriting

Policy)

Data curation, S.P.-E. and D.P-D.; Formal analysis,

S.P.-E., and D.P.-D.; Funding acquisition, F.J.B.,

J.R.P., G.G.P., D.G.M., and A.P.M.; Investigation,

S.P.-E., and D.P.-D.; Methodology, S.P.-E., D.P.-

D., and A.P.M.; Project administration, A.P.M.;

Resources, S.P.-E., D.P.-D., I.R.-P., N.O.-V.,

D.G.M. and, A.P.M.; Software, S.P.-E., and D.P.-

D.; Supervision, A.P.M.; Validation, S.P.-E., and

D.P.-D.; Visualization, S.P.-E.; Writing—original

draft, S.P.-E., and D.P.-D.; Writing—review and

editing, A.P.M. All authors have read and agreed to

the published version of the manuscript

Sources of Funding for Research Presented in a

Scientific Article or Scientific Article Itself

We acknowledge financial support under grant

PID2022-138322OB-100 funded by MCIN/AEI/

and by “ERDF A way of making Europe”. Also,

Xunta de Galicia funded this research under

Research Grant No. 2021-PG036. We also

acknowledge the research network RED2022-

134573-T as well, funded by Ministerio de Ciencia

e Innovación (MCIN/AEI/10.13039/501100011033)

and by ‘ERDF: A way of making Europe’, by the

European Union. And finally; the project

PMPTA22/00115 DEL ISCIII, MADRID SPAIN.

Conflict of Interest

We declare no conflict of interest.

Creative Commons Attribution License 4.0

(Attribution 4.0 International, CC BY 4.0)

This article is published under the terms of the

Creative Commons Attribution License 4.0

https://creativecommons.org/licenses/by/4.0/deed.en

_US

WSEAS TRANSACTIONS on BIOLOGY and BIOMEDICINE

DOI: 10.37394/23208.2024.21.38

Santiago Paramés-Estévez, Diego Pérez-Dones,

Ignacio Rego-Pérez, Natividad Oreiro-Villar,

Francisco J. Blanco, Javier Roca Pardiñas,

Germán González Pazó, David G. Míguez, Alberto P. Muñuzuri

E-ISSN: 2224-2902

384

Volume 21, 2024