Ovarian Cancer Classification: Lessons Learned in a Kaggle Competition
Late last year, I decided to dive headfirst into a Kaggle competition hosted by UBC. The mission: build a model to classify subtypes of ovarian cancer using histopathological images and detect outliers. It had been a while since I'd done real nitty-gritty machine learning work, so this was simultaneously exhilarating and daunting. I ended snagging 22nd place out of 1327 teams, which is certainly a performance to be proud of!
I learned a ton during this process, and I want to share those insights with all of you. At the end of this post, I'll make a few key guidelines I wish I had known when I started. But before we get there, let me walk you through the journey of this particular competition and some of the things I tried.
BTW, if you want to check out all my code, here's the repo of my raw workspace from the end of the competition! https://github.com/prestoj/kaggle-ubc-ocean
My journey through this competition
Data Preprocessing
The first thing I noticed about the dataset was that it was huge. Like, 700 gigabytes huge, larger than I could fit on my computer huge. Despite this, there were only 538 distinct samples. Microscopic imaging is no joke, apparently.
I knew I had to make these images more friendly to work with, so the first order of business was to cut up each image into a bunch of "tiles". Through some small experimentation, I found that the size of the tiles didn't matter too much. I used 768x768 tiles for most of my models.
Modeling
The goal of modeling in this competition was to take the tiles and map them to the 5 different possible classes: HGSC, CC, EC, LGSC, or MC. This is one of the most common problems in computer vision, so there's a lot of research I could draw on. And I really tried everything under the sun. I tried training my own ViT model, fine-tuning a kajillion different pre-trained models, built my own reimplementation of DINO.
There's a lot that could be said of each of the approaches that I took. Instead of going into detail about each, I think it's more productive to just tell you a few things I learned:
- There wasn't enough data to train a model from scratch -- 538 samples just isn't enough, even when each of those samples has over a gigabyte worth of data. You just need so much data to train a model.
- Pre-trained models are VERY good, even if your dataset is out of distribution.
- Training algorithms with funky learning dynamics, like DINO, are so much fun to implement. But they're hard to debug. Understand this trade-off if you want to go down this path.
Things I wish I knew before I started
I'm going to keep this super brief because the learnings really are so simple:
- Unless you have a very large dataset, fine-tune a pre-trained model. This Hugging Face list should be your jumping-off point.
- Use RandAugment and label smoothing to avoid overfitting.
- Use an EMA on your model parameters to smooth everything out.
That's it. That will get you 90% of the way there, the remaining 10% are up to you to figure out.
Conclusion
For me, the journey of solving problems as complex as this is as valuable as the final outcome -- I found this process incredibly rewarding. Even though I didn't get in the top 5 to get the competition's reward, I thoroughly enjoyed the process and learned so much. I can't wait to solve more problems like this in the future :)