Month: August 2014

MZR: Music synchronisation with a band-pass filter

Posted on Updated on

This is how this game MZR happened to be. I always wanted to have visuals synched with music. That was integral part of the game idea so much that after some initial failures to get that going I gave up on the whole game for about couple of months.

bandpass
Band-pass filter graph. Image from http://www.rane.com/note170.html. If you end up using band-pass filters you’ll probably need to read that page too.

Part of my inspiration came from early encounter in the XNA scene with ColdBeamGames’ game Beat Hazard. (http://www.coldbeamgames.com/). An excellent example of how music synchronisation of visuals and game play can work really well. Definitely a direction I wanted to go towards although ColdBeamGames’ stuff in that area is just on another level.

Again this is a vaguely technical post. It’s going to be fairly simple stuff thought.

In order to synchronise visuals with music you want to be able to turn a waveform signal (audio signal) into a signal that drives your graphics. This can be done in couple of ways (that I know of):

  • processing the raw audio input and extracting the amplitude of different frequencies (drums would be a fairly low frequency – for example 100Hz, voice is in the middle ones, etc). Then using that result signal to drive graphics.
  • tagging – visualising the wave form and using a tool to place various events on the track, matching the beats and various other music facets. During game you can then synchronise the tag stream with the music stream and have the tags drive the visuals or game. I imagine that’s how most “guitar hero” games are done.

Both approaches have advantages and disadvantages. Automatic processing can handle all sorts of music and produce fidelity you can’t ever achieve with tagging. On other hand Automatic processing detects frequencies – it’s simple as that. If you want anything more complex that is matched by something like a song chorus or a specific music phrase – you want tagging. Also nothing stops you form using both approaches together.

In MZR I use automatic processing.  In this post I’ll describe how got there.

I tried FFT first

Fast Fourier Transform is (better explanation from Wikipedia) an algorithm that can compute the discrete Fourier transform. A Fourier transform is one that can take a signal from time domain (amplitude over time waveform) to frequency domain (amplitude of frequencies). In essence you provide an array of values which is the sound (at 44KHz you get 44000 values per second) and you get an array of  frequencies. In the frequencies array each item is the amplitude of that frequency.

You can find FFT implementations on the internet – there are ones in almost every programming language I can imagine. It’s a known numerical recipe [points for those who got this pun ;)].

So, what do you do with this array of frequencies?

  • find the dominant frequency – this is as simple as finding the array item with the largest value
  • lookup a frequency range that you are interested in
  • visualise it – in a traditional graphic equaliser the array items would each be a bar and the item values would be how high those bars are lit up

So why didn’t I use this? Where’s the gotcha with using FFT?

FFT is successfully used for this exact purpose. However I had couple of issues with this, some of them unrelated to the algorithm.

First and foremost I had made a mistake with my wave forms array calculations. Due to that  I was feeding corrupt data into the FFT and was getting results that were really wrong. As there are complex numbers involved and I don’t fully understand it I assumed I had implemented it wrongly or was using it incorrectly. The result after hours of experiments and visualisation was to abandon this solution and look for another one.

Another, not so valid reason is that FFT can be costly to calculate. I grew up in the ’90 when we were counting every unorthodox operation and if simpler solution was available that could do the job that was the preferred solution. Today’s computers even on mobile devices are pretty powerful and wouldn’t bat a eyelid at this algorithm.

Not long after I had a chat and sought advice from a friend of mine – Neil Baldwin. He is far better informed in all matters audio, audio programming, audio electronics, music, etc… besides being an all round fantastic fellow. Checkout his site, especially if you like chiptunes and NES stuff – you wont be disappointed.

Anyway, that’s when I learned about band pass filters.

Band-pass filter

You have probably heard of low-pass and high-pass filters, right? If not, a low-pass filter is one that once applied leaves only the low-frequencies in the signal. I’ll leave you guessing what frequencies a “high-pass” filter leaves.

A band-pass filter is similar to the low and high pass ones but can be applied for an arbitrary frequency band.  So you can say stuff like apply band-pass for a band that is centred at 440Hz  and covered 200Hz each side. See the graph image at the top of the article and the link under it – that has a good and more in-depth explanation of what a band-pass filter is.

I believe my implementation was based on this internet post  and code sample on the topic.

I thought the band-pass filter was a lot simpler to implement than a FFT and a lot easier to understand. I used it instead of my FFT implementation. No joy! That’s when I left this venture and took a break.

Screenshot 2014-08-23 15.29.27
This is how my visualization looks. Absolutely basic – took minutes to get going and visualize the processing. The middle bit is the waveform the bars at the bottom is the amplitude for the selected band at 60 fps.

Couple of months later…

… I suddenly realised that the problem wasn’t with the FFT or the band-pass implementation. I had made a mistake of how I calculated the wave data size and converted the stereo channels into a mono stream to process. Once that was fixed, everything fell into place.

It’s good to mention couple of things about debugging this stuff, even though I may not be the best person to advise on that given the above story. At least I knew I was doing it wrong and it was because I did this:

  • visualise your input stream and output stream – I wrote a quick app in c# that drew the wave data and then overlay the processed on top – see if they matched. That’s how you know the algorithm is working.
  • play the music on top visualising the playback time to see if the processed data matches the music playback in any usable form. That’s how you know you can use the output you are getting.

How did I end-up not spotting the problem then if I had such visualisation. My mistake was such that the calculated input signal data didn’t match the length of the music. I was visualising only the start of the music track (a few seconds) and things looked ok-ish… however once used in game things quickly went wrong and I couldn’t figure out why. Once I visualised the whole music track I could clearly see how the stream was ending well before the end of the music track. I guess the moral of the story is – if I have debugging tools I should use them in every possible way, not just in the one narrow minded approach I started with.

Actual game data

Once I got this working, I could see two ways to use these filters:

  • pre-process the data offline and bake it into a file, then load in a game and use it synchronised
  • use the live audio stream as sent to the audio output of the device and process that

I chose the first method. My decision was mainly driven by having to implement the audio output capture on multiple platforms. The iOS capture looked complicated enough. Plus, I wasn’t planning on changing music tracks or using the user’s music library (like Beat Hazard mentioned above).

I run 4 bands band-pass filters offline on all music tracks found in the game. I store the result at 60fps – my target frame rate. Each item is a vector4 with x,y,z and w being the output for each band – so I got 60 vector4 per second of music.

At game run-time I load the pre-processed data and track where the music playback has reached to – then I sample the vector4 value and attenuate it gradually in game.

That way in-game I always have a vector4 value representing 4 frequency bands that I can synchronise my visuals with.

Finally

Most of the time you don’t have to deal with this. I mean signal processing is fascinating but it’s hard to get right. Most game engines these days would have built-in functionality to give the FFT of an audio stream. That’s right. For example Unity provides something called AudioSource.GetSpectrumData. Check it out.

That’s it!

See you next time.

 

MZR leadergrid programming tricks

Posted on

IMG_1156
The MZR leadergrid.

In MZR (now on App Store: http://bit.ly/MZR_Game) there is an alternative approach to the concept of leaderboards. As the height of the player’s tower is the score we decided that it might be a good idea to have everyone else’s tower around. See your friends and their scores at a glance. As they are all represented as square columns on a grid I call this the MZR leadergrid.

User can browse around, zoom in or out and see the global picture – scores aggregated by country as well as their friends.

In this post I’ll talk about the leaderbaord implementation. In general it’s a collection of programming tricks and shortcuts and I wanted to describe some of those.

Most of the techniques are aimed at providing visual complexity while reducing the number of draw calls. High number of draw calls is possibly the main thing that would hurt your performance  – any opportunity to reduce that number without impact on quality should be considered.

The implementation of the MZR leadergrid score representation is split in two parts.

  • The local grid where every tower is an accurate representation of a real active player.
  • The global grid where a representation of the global leadergrid is rendered based on aggregated data from our servers.

Local grid

Local in this case refers to the locality of the other users’ towers to the centre player tower.  There is a grid of 9×9 towers around the main one.

This local grid, contains a few types of entries:

  • friend entries – these are the towers of friends logged in Game Center or Facebook. Every friend gets their avatar picture on top of their tower, as well as their score
  • active player entries – there are towers of players who have played the game recently. The idea is to get a selection players who are currently playing the game.
  • global grid entries – these are entries that could not be filled by friends or active players and are sampled in a similar way as the global grid one (described below)
output_J5uFZq
Local leadergrid – 9×9 towers of friends and active players. Animation showing each separate render call: towers, frames, friend avatars and scores.

In a way this local bit is the “friends” section of a normal leaderboard but extended so it can always fill 9×9 matrix.

This local grid is the immediate vicinity of the player playing field and as such visible during gameplay. I also wanted to indicate on the nearby columns (towers) how the current player score compares to them.

The local grid is accurate. Every column (friends and active players) is an actual player with their score as reported by our servers. Every time a player submits a high-score to the servers, it is stored in our database and then reported to the player’s friends when they play.

Rendering of the local grid

All of the columns in the local grid are a single rendering mesh. That allows efficient draw call submission – 1 call is better than 9×9 calls. The writing and avatars on top are additional draw calls as they are translucently blended on top. That also allows me to dynamically alter/grow columns as players upload higher scores. Each column is always in the same place in the vertex array so I can calculate what part of the vertex array I’m changing when the column grows.

A similar approach is taken with player scores. A single draw call renders all players score. They use the same font and can change dynamically. That’s possible because I always leave space in the vertex array for 4 digits. By using degenerate (zero area) triangles I can display scores that with less than 4 digits while maintaining vertex count for the full 4 digits.

Finally the avatars. The avatars can change at any time as player logs in to Game Center or Facebook and gets lists of friends… and their avatar pictures are downloaded from the respective network. Furthermore, to optimise the draw call for those avatar images they need to be in the same texture – not separate textures. Again a single draw call is a lot better than 81 (9×9) ones.

To achieve that I pack all images into a dynamic texture atlas. I use a render-to-texture technique to pack each avatar picture into a separate place on the same texture. Then I just need to update the display avatar mesh with the right texture coordinates of where that avatar image finds itself in the final atlas texture. A bonus of this atlas rendering is that I can apply a custom shader effect to the avatar images achieving the specific MZR look they have.

Finally I have the hight indicators rendered on the closest friend columns indicating how high the current player run has reached. At the same time if a local grid column is overtaken, it would flash briefly by changing the colours of the appropriate vertices in the grid vertex array.

Global grid

The global grid is everything outside the 9×9 local grid. It repeats infinitely the space so the user can scroll around in any direction when they browse. By repeats infinitely, I mean that the camera is telported back once it reaches certain limit – wrapping back to the opposite side of the area – that gives impression of endless scrolling.

The global grid is not a 1-to-1 representation of all the players in the world. It is a statistical representation. This is done for couple of reasons:

  1. to reduce the data transfer between the app and our servers
  2. to show the player a bigger picture about the scores of the world.

    table_country_pic
    Scores are aggregated by country on the server. This table shows the top 10 countries (by max score) at the time of writing of this article. (score = average score, score_count = number of players submitted score form that country, score_max = the top score form that country)

To achieve that I do a certain aggregation on the server with respect to players scores. At regular intervals a routine runs that calculates stats for every country in the world. The stats calculated are number of players from that country, the average score for that country and the maximum score.

The game client downloads the country score stats from the server. The global leadergrid is then rendered using that information. You can look at this from a data compression point of view: it’s a loss compression method where the data aggregation is the compression and the visualisation is the decompression part.

In terms of visualisation each country looks like a small mountain of columns. The more people play in that country the larger space it occupies. The top score for that country is the highest point of the mountain. The average score of the country dictates how “sharp” the mountain is. If the average score is low (a lot lower than the max one) then we get a mountain that’s fairly pointy – naturally the people with high scores are not many. If the average score is high, then we get a meatier mountain as there are more people with relatively high scores.

Using average and max scores to control the curve of the country statistical representation.
Using average and max scores to control the curve of the country statistical representation.

I achieve that by using a power function to approximate the curve slope. I take the average score (av) and the max score (max) for a given country and calculate the ratio as av/(max*0.5). That way score average that is half the max score for that country would mean a power of 1. Say we have average score of 50 and max score of 100 – give the formula we get 50/(100*0.5)=1. We then use the function f(x) = 1-pow(x, av/(max*0.5)) to get the height at various point of x where x is the distance of the sample point form the centre of the country region.

Note: the mathematical ground behind this was if we imagine these mountains as cones (or a swept/revolved curve that makes a kind of a strange cone) then the volume of that cone would be the total added scores of all players playing. That can be found by calculating player scores x average score. We know the height of the cone (max score) and the radius of the cone (number of player playing). Having the radius and the height of the  cone the unknown would be the curve which swept would dictate the volume of the “cone”. The connection between the curve and the cone volume would be an integral of that curve over a circle.

Solving this  however goes beyond my mathematical skills and seemed extremely OTT when I had a sensible approximation already.

Global grid country field

IMG_1157
Global country leadergird. Canada has the lead. Notice the relatively pointy score mountain due to relatively low average score of 213 (see the table above) for CA compared to the top score of 735.

Once the country scores have been downloaded from the server. They are inserted into a CountryField object that places them on a plane and uses a relaxation algorithm to make sure they are spaced evenly according to their respective radius. After that I can query the CountryField at any point in the world and it will return a height for that point based on what countries overlap and influence that point. The relaxation is seeded randomly so that every time a user gets the countries from the servers the global grid looks different.

When sampling at a certain point I consider all countries that the sampling point overlaps with. Then I apply a calculation described above and take the highest sample position. Layered on that is a deterministic noise function so that the score mountains don’t look uniform.

The rendering routine then uses this CountryField object to sample the heights for each of the global grid columns.

Adaptive subdivision 

Rendering all global grid towers in one go proved slow – too much geometry to update at the same time. I opted out for an adaptive subdivision technique where if columns are close to the camera then they are rendered if they are further away they get converted into a bigger column that’s the average of their height (it’s actually biased above the average as that proved better looking). This is done on 3×3 columns basis.  As camera moves around some of these combined columns split up dynamically and other coalesce. This combined with some non-linear interpolation gives the unique MZR look when user is browsing around.

This adaptive technique also means that geometry rendered is kept under control.It also allows the adaptive algorithm to work on only fraction of the geometry to interpolate and update the vertex array. The entire global grid is a single mesh and there for a single draw call.

Country names

On top, like clouds, are rendered the country initials (alpha-2 country code) and the country top score. This is again a single mesh so I can render that in a single draw call.

 

Learnings

MZR has unique look when it comes to leaderboards. In some ways however it is a bit “style before function”. That’s ok. MZR has always been about trying something new, be willing to go where traditionally games have done stuff another way.

With that in mind here’s what I think could be better:

  • No easy way to compare. In a traditional leaderboard user gets a list of entries with them inserted in the middle. They immediately know who is ahead of them and behind them. This information is more difficult to read in MZR. User gets a bigger picture view but details remain a bit hazy – the details are still there but not so easy to compare.
  • No easy way to compare country scores. User has to browse around  Its’ been fun to watch different countries take the lead (as of writing of this post Canada has the lead) but it would have been better to have a clear indication of who is in front.
  • In retrospect just using alpha-2 country codes make for a more alienating experience. Using something like full names or flags would have work better.

 

That’s it. If you are interested in reading more about any of the areas I describe, do let me know and I can write that up in more detail in a separate post.

See you next time.