In the past I have worked on a project called Spoken Wardrobe. The concept was a camera display where the user could speak to the screen/ reflection of themselves and whatever they spoke would be used as a prompt to stable diffusion to create an AI generated inpainting image of the person wearing whatever they wished for.

Why?
This project utilized a proxy api hosted on glitch made by Dan O Sullivan. Unfortunatelly that api is no longer working as of recent so I thought maybe I could update my project to use transformers.js
Whisper
-
The first part of the project involved the use of Whisper which allowed the conversion of speech to text, which would then be used to prompt stable diffusion
-
I found the whisper sketch using p5.js and used that as my basis to incorporate and load in the whisper model into my code.
- Since I was in the process of updating my code anyways, I thought that I might as well try and improve my audio to text trascription.
- My current version of the code employed a quite rudimentary system where the whisper model was prompted every 3 seconds with whatever audio blob was produced between the last transcription request and then. This served as a simple way to create the illusion of consistent and constant transcription
- Looking at the example shown in Joshua’s slides I really wanted to try and update my code to have a much more efficient looking constant transcription like that shown in the example
Screen Recording 2025-03-21 at 12.13.52 AM.mov
https://hf.co/spaces/Xenova/realtime-whisper-webgpu
- My first attempts at even loading the transformers.js model in my code were not succeeding for some reason. Nothing was being transcribed. Thus, I first tried it out on the original sketch.js example
- Somehow this was not working anymore. No matter what I spoke it just transcribed the word “you” I was quite puzzled for a long while. I tried to change the model, nothing, then tried to alter how to audio was being recorded, nothing. I was almost to the end of my patience and was about to call this a complete failure when I tried to run the project one last time and suddenly it worked! Guess what, for some reason it only works when my airpod mic is connected. Very peculiar activity. Nontheless, the amount of time I spent talking to my computer with absolutely no results was quite astonishing


-
Finally I felt I could start moving forward and once I had gotten it to work on my p5.js sketch example, and realizing that my mic was the issue, it also suddenly worked on my main project codebase.
-
This is when I began to encounter many more errors!
- Most of which originated from the fact that I was using a rather archaic method to simualte constant transcription (i.e. constantly starting and stopping the transcription)
- I began more carefully looking at how the example Joshua provided worked and realized that it utilized a worker.js script
- I have not been able to fully understand the complexities of this script and their own code for the rest of the script is also not quite intutiive.
- Despite attempts to even feed it to Claude to try and get a better understanding it still baffles me.
- Overall I have understood that the audio is processed is smaller chunks than my own project and the chunks I believe are being processed one at a time, constantly being added to an array of chunks that is then depleted as the transcription occurs

Stable Diffusion
- To my disbelief there is actually not much suport yet for stable diffusion models in the ONNX community. The only model I managed to find was one where it listed that the pipeline that “could” be used to load it was some obscure reference that I could not find

Future objectives
- I want to get this to work and will be booking office hours for the implementation of the stable diffusion model
- I think I can get the whisper aspect to function on my own