Ovarian Cancer Classification: Lessons Learned in a Kaggle Competition

February 16, 2024

Late last year, I decided to dive headfirst into a Kaggle competition hosted by UBC. The mission: build a model to classify subtypes of ovarian cancer using histopathological images and detect outliers. It had been a while since I'd done real nitty-gritty machine learning work, so this was simultaneously exhilarating and daunting. I ended snagging 22nd place out of 1327 teams, which is certainly a performance to be proud of!

I learned a ton during this process, and I want to share those insights with all of you. At the end of this post, I'll make a few key guidelines I wish I had known when I started. But before we get there, let me walk you through the journey of this particular competition and some of the things I tried.

BTW, if you want to check out all my code, here's the repo of my raw workspace from the end of the competition! https://github.com/prestoj/kaggle-ubc-ocean

My journey through this competition

Data Preprocessing

The first thing I noticed about the dataset was that it was huge. Like, 700 gigabytes huge, larger than I could fit on my computer huge. Despite this, there were only 538 distinct samples. Microscopic imaging is no joke, apparently.

I knew I had to make these images more friendly to work with, so the first order of business was to cut up each image into a bunch of "tiles". Through some small experimentation, I found that the size of the tiles didn't matter too much. I used 768x768 tiles for most of my models.

Modeling

The goal of modeling in this competition was to take the tiles and map them to the 5 different possible classes: HGSC, CC, EC, LGSC, or MC. This is one of the most common problems in computer vision, so there's a lot of research I could draw on. And I really tried everything under the sun. I tried training my own ViT model, fine-tuning a kajillion different pre-trained models, built my own reimplementation of DINO.

There's a lot that could be said of each of the approaches that I took. Instead of going into detail about each, I think it's more productive to just tell you a few things I learned:

There wasn't enough data to train a model from scratch -- 538 samples just isn't enough, even when each of those samples has over a gigabyte worth of data. You just need so much data to train a model.
Pre-trained models are VERY good, even if your dataset is out of distribution.
Training algorithms with funky learning dynamics, like DINO, are so much fun to implement. But they're hard to debug. Understand this trade-off if you want to go down this path.

Things I wish I knew before I started

I'm going to keep this super brief because the learnings really are so simple:

Unless you have a very large dataset, fine-tune a pre-trained model. This Hugging Face list should be your jumping-off point.
Use RandAugment and label smoothing to avoid overfitting.
Use an EMA on your model parameters to smooth everything out.

That's it. That will get you 90% of the way there, the remaining 10% are up to you to figure out.

Conclusion

For me, the journey of solving problems as complex as this is as valuable as the final outcome -- I found this process incredibly rewarding. Even though I didn't get in the top 5 to get the competition's reward, I thoroughly enjoyed the process and learned so much. I can't wait to solve more problems like this in the future :)

Preston Jensen