Mainly the TLB, with consideration for the L1-L3 cache, on the CPU has to be aware of the pages the GPU is utilizing, and modifying.
Again, this isn't a problem when the entire bus is shared. When the CPU and GPU have to 'fight' for the same access to the bus, they can be kept coherent, naturally.
AGP was a specialized 32-bit PCI bus that also did DiME, but Intel was still shared bus. But once the 32-bit Athlon MP (which actually used the 40-bit interconnect from the true 64-bit Alpha 264) came out, the ability of CPUs, even I/O, to directly access memory independently of other 'points' on the 'crossbar switch' introduced the biggest coherency mess.
Intel knew they'd eventually have to follow AMD's lead, especially once Opteron and, correspondingly, the AMD64 hit (and literally started smacking Intel out of datacenters), they designed serial PCI Express (PCIe) to support some features for DiME better. At the time, it was referred to as Next Generation I/O (NGIO) in the PCIe spec. But it would be many more years before QuickPath Interconnect (QPI) actually showed up, before Intel was able to take advantage of the aggregate throughput possible.
Had Intel actually had QPI in its processors at the time PCIe was introduced, we might all not have PCIe video cards. We'd probably have a direct GPU slot that is right on the QPI itself. This would solve a lot of concurrency issues.
Side Note: Nehalem, QPI and their first 38-bit (256GiB) Processor Address Extensions (PAE) capable x86-64 chips (Intel processors, x86 and x86-64 were only capable of 32-36-bit, 4-64GiB before then) actually had major issues in its first revision, especially when it came to multi-socket. It was the most radical system interconnect change for Intel since the old Pentium Pro of the early '90s, with introduced 36-bit (64GiB) PAE in the first place, with the i686.
The irony here is that Digital largely helped Intel develop (and they later sued Intel over it, long story) the 32-bit Pentium Pro (i686) TLB, in addition to the ALU because Intel had a design failure in the original i586 Pentium's design (the ALU sucked, which is why Intel used the FPU to load integers -- a 'Pentium optimization' that was stupid, but necessary) -- from Intel's limited i486 TLB. This introduced the 3-level, 36-bit (64GiB) PAE and paging became the 'blue print' for AMD's "Long Mode" that most people know as x86-64 today. It's 48-bit (256TiB) flat addressing mapped to (up to) 4-layer, 52-bit (4PiB) PAE paging.
Windows users didn't see it, not even Windows Server Datacenter edition users, but our Linux servers with 128-512GiB RAM (yes, RAM ... in 2007 -- only AMD was capable of 512GiB at the time) ran into it. I'm under NDA, but the 5 Intel errata from the time are in the Linux kernel release notes. Most people were unaware, because it was just early adopters who needed >>64GiB RAM, like Wall Street, Hollywood, etc... Made us quickly run back to AMD.
Now AMD had similar issues with the Processor 10h, as they moved to the full, 48-bit (256TiB) platform address that x86-64 is capable of, from 40-bit (1TiB) prior from the Alpha 264 lineage (all the way back to the original, 32-bit Athlon MP -- which was really the platform prototype for Opteron x86-64). But unlike Intel, AMD held off releasing their multi-socket Processor 10h when they discovered new coherency issue, to much industry criticism.
But some in the media picked up the fact that Intel had major issues with its initial QPI multi-socket products. Being at a Wall Street customer, I got to deal with this, but couldn't say a word, at the time.Now AMD did create its own CPU/GPU/Communication slot with HyperTransport Extension (HTX), and even had several, custom, high-end visualization systems sporting unreal throughput for GPUs, as well as supercomputers with Infiniband (after dealing with Infiniband, you hate Ethernet
). But since AMD cannot influence commodity OEMs of systems and accessories, most people never saw it. HyperTransport was designed by API Networks, who AMD eventually acquired, before buying ATI a couple years later. API stands for Alpha Processor, Inc. (API), the entity created by Intel to avoid anti-trust issues when they bought Alpha from Digital in their "sell-off-a-thon" in the late '90s.
Part of the reason why AMD had some of the brightest designers in the '00s, and Intel was surviving on fabrication lead alone (which is really boils down to cash for fab investment), and even then, losing to AMD designs that were fabbed at technologies 2-3 years behind. I think at one point in the '90s, Digital Semiconductor owned 75% of the communciation/networking and 50% of the system and peripheral interconnect IC market.
Although the AMD APU and System-on-a-Chip (SoC) designs sport the ATI GPU directly on the HyperTransport interconnect today. And this is where AMD is headed with their ARM-based products. They're largely aimed at the growing MicroServer market for now, because power and cooling are even more important (let alone far more profitable for AMD) in datacenters. But a future ARM-based Console is not really a matter of 'if,' but 'when,' in a future revision of both Microsoft and Sony consoles.
The AMD Jaguar/Puma dual-APU x86-64 packages are currently in both consoles, including the recent refits. But that is another story.