If what you're talking about is starting from an audio mix and estimating the complete set of notes that produced it, you could try Silvet (http://code.soundsoftware.ac.uk/projects/silvet), a (C++) implementation of a polyphonic note estimator from audio.
It's realised as a Vamp plugin which you can run in a host like Sonic Visualiser to review the results, play them back, and export as MIDI. (I'm involved with both projects.)
The general shape of this method, and of many related methods, is:
* convert audio to a time-frequency representation using some variation on the short-time Fourier transform
* match each time step of the time-frequency grid against a set of templates extracted from frequency profiles of various instruments, using some statistical approximation technique
* take the resulting pitch probability distributions and estimate what note objects they might correspond to, using simple thresholding (as in Silvet) or a Markov model for note transitions etc
Silvet is a useful and interesting implementation, but if you try it, you'll also learn the limitations of current methods when used against complete musical mixes. (Some of this is intrinsic to the problem -- the information might not be there, and humans can't always transcribe it either.)
I've heard Melodyne solves this problem very successfully, and the demos look impressive. Any idea what it's doing? Is it patented / secret / witchcraft? Or just has more templates?
I don't have any worthwhile insight, I'm afraid. I expect it's partly high-quality methods, partly a lot of refinement for common inputs and use cases.
Academic methods tend to be trying to work towards a very general problem such as "transcribing a music recording". A tool intended for specific real users can approach the problem from a perhaps more realistic perspective.
It's realised as a Vamp plugin which you can run in a host like Sonic Visualiser to review the results, play them back, and export as MIDI. (I'm involved with both projects.)
The general shape of this method, and of many related methods, is:
* convert audio to a time-frequency representation using some variation on the short-time Fourier transform
* match each time step of the time-frequency grid against a set of templates extracted from frequency profiles of various instruments, using some statistical approximation technique
* take the resulting pitch probability distributions and estimate what note objects they might correspond to, using simple thresholding (as in Silvet) or a Markov model for note transitions etc
Silvet is a useful and interesting implementation, but if you try it, you'll also learn the limitations of current methods when used against complete musical mixes. (Some of this is intrinsic to the problem -- the information might not be there, and humans can't always transcribe it either.)