Multi-Modal Model based on Distilled DeepSeek R1

This is just a note for my own study. It is unrelated to any of my actual work. No proprietary knowledge is used.