The first time I saw the Shazam app, I was floored. The app listens to a clip of music through your phone’s microphone, and after a few seconds it is able to identify the recording. True to its name, the software works like magic. Even with significant background noise (for instance, in a noisy bar) the app can recognize that Feels Like We Only Go Backwards is playing on the jukebox.
I recently came across a fantastic writeup by Will Drevo about how “audio fingerprinting” schemes are able to correctly identify a recording. Will wrote an open source Python library called dejavu for audio fingerprinting. His writeup gives a detailed overview of how the software works. Will’s explanation is a gem of expository writing about a technical subject. He gives an overview of the requisite knowledge of signal processing necessary to understand how the fingerprinting works, then describes the scheme itself in some detail. Essentially the software works in four steps:
- Build a spectrogram of a recording.
- Find “peaks” in the spectrogram.
- Identify patterns or “constellations” among the peaks which will serve as fingerprints for the recording.
- Apply a hash function to the fingerprints to compress them to a more reasonable size.
Will’s explanation of how and why these steps work to give a robust way of recognizing audio recordings is vivid and accessible. It is a remarkable narrative about how signal processing, optimization, and data structures combine to create a truly miraculous product.