CPU manufacturers recognised several years ago that profiling was increasing in importance, and as a result many CPUs, such as the MIPS R10000, the Alpha/AXP, the Intel Pentium series, and more, provide at least some hardware support to assist a software-based profiler. At one extreme, bolt-on hardware has been produced to assist in profiling, for example [profileme].
One of the simplest things a CPU can provide is a high-resolution timestamp counter such as the Pentium's TSC. This allows interstitial timing harnesses for measuring operation latency to a high degree of accuracy.
At the next level of complexity there are performance counters. These are typically registers that count events of interest such as cache line misses. The benefits of such counters are well known[mipsr10000][monitor]: actual data from the hardware that can be attributed to sections of source code removes a lot of the black magic previously associated with performance analysis.
Typically software using such counters either periodically check the value of the counter, or, if possible, use counter overflow events to generate an interrupt, which then logs the overflow event against the currently executing code.
More recent architectures have gone even further in terms of support, providing much of the data collection machinery in hardware[ia64][ia32].