This is a submission for the Cloudflare AI Challenge.
What I Built
You're probably familiar with the phrases "an image tells a thousand words" and "music is a way to express yourself". So I combined the two core statements of those phrases and came up with ImageHarmoni an AI tool using Cloudflare Worker AI with the models Llama and Mistral to generate lyrics for a user selected genre based on a user given image.
Whisper model is also used but only for some additional fun. 🤫
Demo
GIF is speed up since generation takes around 30 seconds
My Code
Used Tools, Components and inspiration
Dark Mode Toggle from Justin Schroeder - CodePen
Color Themes for light and dark mode - Realtime Colors
Inspiration for moving emojis in background - CSS glassmorphism generator
Journey
Since I never worked with any AI models in the past and the Cloudflare environment was also completely new to me, I searched for some beginner guidance and luckily stumbled upon this amazing Cloudflare Workers tutorial on YouTube.
During the tutorial, the difference between the user and system role gets explained and how you can use them. So basically with the system content you can tell the model how it should respond to the user request and this was the point where I came up with my project idea because my genre selection uses exactly this behavior.
For example, if you selected the classic music genre, you tell the llama-2 model that it is now Wolfgang Amadeus Mozart and his task is to create a classical song. I also used three adjective which fit the most with the selected genre to get a somewhat better result for the lyrics.
Now we only needed a fitting user request to generate lyrics with, but since a simple input directly from the user would be a bit boring, I looked through the available models for the challenged and saw that there we also two which could scan images for their content. So I had a complete concept for my tool.
I tested a bit with the resnet-50 model and mistral and ended up using mistral because it gave me for my case the better description of the provided images. Now, I could have taken the short route and directly gave the image description to llama-2 to generate the lyrics, but the results were, to put it gently, improvable.
Therefore, I used llama-2 a second time but this time with the task to extract all persons, objects, emotions, moods ... stuff you need for lyrics from the given image description. And with this extracted keywords I feed the llama-2 model which is responsible for the lyric generation itself.
And then we have the story with the use of the whisper model. So if you open ImageHarmonie the UI experts under, you probably wonder why I choose a 3 by 3 grid for the genre selection if you can only select 8 genres. But if the night breaks in (toggle dark mode) and you look closely you can see a very cool vegetable which ask you something about fruits but to answer him you have to talk. Based on your response, the 3 by 3 grid in the genre selection actually makes sense.
When you talk with the vegetable, I use the whisper model to scan the user input. First I only scan directly for the needed code word if it is in the user input or not but if the users say something like "no code word" it would still trigger with scanning only for the presence of the code word. So it was once again time for llama-2 to shine. This time, the task was to determine whether the user input was pro or contra code word. (Unfortunately, the determination whether pro or contra is not optimal, so I have left the evaluation as a conlo.log to understand how llama-2 has decided)
And while waiting for the used neurons to reset (since testing if the lyrics were somewhat okish took quite a few requests) I styled the site a bit with good old CSS and to also use the tool on the go I made it responsive with media queries. But since I took no time to further look into pages or how to use the workers with a framework like angular because the deadline was quite short the complete code is basically in on file so sorry to all clean code enthusiasts, I vow to do better in the future.
Multiple Models and/or Triple Task Types
llama-2:
- lyric generation bases on selected genre
- extracting keywords from image description
- check whether input is pro or contra code word
mistral:
- generating image description
whisper:
- listening to user for code word input