This is how this game MZR happened to be. I always wanted to have visuals synched with music. That was integral part of the game idea so much that after some initial failures to get that going I gave up on the whole game for about couple of months.
Part of my inspiration came from early encounter in the XNA scene with ColdBeamGames’ game Beat Hazard. (http://www.coldbeamgames.com/). An excellent example of how music synchronisation of visuals and game play can work really well. Definitely a direction I wanted to go towards although ColdBeamGames’ stuff in that area is just on another level.
Again this is a vaguely technical post. It’s going to be fairly simple stuff thought.
In order to synchronise visuals with music you want to be able to turn a waveform signal (audio signal) into a signal that drives your graphics. This can be done in couple of ways (that I know of):
- processing the raw audio input and extracting the amplitude of different frequencies (drums would be a fairly low frequency – for example 100Hz, voice is in the middle ones, etc). Then using that result signal to drive graphics.
- tagging – visualising the wave form and using a tool to place various events on the track, matching the beats and various other music facets. During game you can then synchronise the tag stream with the music stream and have the tags drive the visuals or game. I imagine that’s how most “guitar hero” games are done.
Both approaches have advantages and disadvantages. Automatic processing can handle all sorts of music and produce fidelity you can’t ever achieve with tagging. On other hand Automatic processing detects frequencies – it’s simple as that. If you want anything more complex that is matched by something like a song chorus or a specific music phrase – you want tagging. Also nothing stops you form using both approaches together.
In MZR I use automatic processing. In this post I’ll describe how got there.
I tried FFT first
Fast Fourier Transform is (better explanation from Wikipedia) an algorithm that can compute the discrete Fourier transform. A Fourier transform is one that can take a signal from time domain (amplitude over time waveform) to frequency domain (amplitude of frequencies). In essence you provide an array of values which is the sound (at 44KHz you get 44000 values per second) and you get an array of frequencies. In the frequencies array each item is the amplitude of that frequency.
You can find FFT implementations on the internet – there are ones in almost every programming language I can imagine. It’s a known numerical recipe [points for those who got this pun ;)].
So, what do you do with this array of frequencies?
- find the dominant frequency – this is as simple as finding the array item with the largest value
- lookup a frequency range that you are interested in
- visualise it – in a traditional graphic equaliser the array items would each be a bar and the item values would be how high those bars are lit up
So why didn’t I use this? Where’s the gotcha with using FFT?
FFT is successfully used for this exact purpose. However I had couple of issues with this, some of them unrelated to the algorithm.
First and foremost I had made a mistake with my wave forms array calculations. Due to that I was feeding corrupt data into the FFT and was getting results that were really wrong. As there are complex numbers involved and I don’t fully understand it I assumed I had implemented it wrongly or was using it incorrectly. The result after hours of experiments and visualisation was to abandon this solution and look for another one.
Another, not so valid reason is that FFT can be costly to calculate. I grew up in the ’90 when we were counting every unorthodox operation and if simpler solution was available that could do the job that was the preferred solution. Today’s computers even on mobile devices are pretty powerful and wouldn’t bat a eyelid at this algorithm.
Not long after I had a chat and sought advice from a friend of mine – Neil Baldwin. He is far better informed in all matters audio, audio programming, audio electronics, music, etc… besides being an all round fantastic fellow. Checkout his site, especially if you like chiptunes and NES stuff – you wont be disappointed.
Anyway, that’s when I learned about band pass filters.
You have probably heard of low-pass and high-pass filters, right? If not, a low-pass filter is one that once applied leaves only the low-frequencies in the signal. I’ll leave you guessing what frequencies a “high-pass” filter leaves.
A band-pass filter is similar to the low and high pass ones but can be applied for an arbitrary frequency band. So you can say stuff like apply band-pass for a band that is centred at 440Hz and covered 200Hz each side. See the graph image at the top of the article and the link under it – that has a good and more in-depth explanation of what a band-pass filter is.
I believe my implementation was based on this internet post and code sample on the topic.
I thought the band-pass filter was a lot simpler to implement than a FFT and a lot easier to understand. I used it instead of my FFT implementation. No joy! That’s when I left this venture and took a break.
Couple of months later…
… I suddenly realised that the problem wasn’t with the FFT or the band-pass implementation. I had made a mistake of how I calculated the wave data size and converted the stereo channels into a mono stream to process. Once that was fixed, everything fell into place.
It’s good to mention couple of things about debugging this stuff, even though I may not be the best person to advise on that given the above story. At least I knew I was doing it wrong and it was because I did this:
- visualise your input stream and output stream – I wrote a quick app in c# that drew the wave data and then overlay the processed on top – see if they matched. That’s how you know the algorithm is working.
- play the music on top visualising the playback time to see if the processed data matches the music playback in any usable form. That’s how you know you can use the output you are getting.
How did I end-up not spotting the problem then if I had such visualisation. My mistake was such that the calculated input signal data didn’t match the length of the music. I was visualising only the start of the music track (a few seconds) and things looked ok-ish… however once used in game things quickly went wrong and I couldn’t figure out why. Once I visualised the whole music track I could clearly see how the stream was ending well before the end of the music track. I guess the moral of the story is – if I have debugging tools I should use them in every possible way, not just in the one narrow minded approach I started with.
Actual game data
Once I got this working, I could see two ways to use these filters:
- pre-process the data offline and bake it into a file, then load in a game and use it synchronised
- use the live audio stream as sent to the audio output of the device and process that
I chose the first method. My decision was mainly driven by having to implement the audio output capture on multiple platforms. The iOS capture looked complicated enough. Plus, I wasn’t planning on changing music tracks or using the user’s music library (like Beat Hazard mentioned above).
I run 4 bands band-pass filters offline on all music tracks found in the game. I store the result at 60fps – my target frame rate. Each item is a vector4 with x,y,z and w being the output for each band – so I got 60 vector4 per second of music.
At game run-time I load the pre-processed data and track where the music playback has reached to – then I sample the vector4 value and attenuate it gradually in game.
That way in-game I always have a vector4 value representing 4 frequency bands that I can synchronise my visuals with.
Most of the time you don’t have to deal with this. I mean signal processing is fascinating but it’s hard to get right. Most game engines these days would have built-in functionality to give the FFT of an audio stream. That’s right. For example Unity provides something called AudioSource.GetSpectrumData. Check it out.
See you next time.