A years-long headache for Zurich's cultural memory institutions moved closer to a practical fix this week, as ETH Zurich's Visual Computing Group confirmed that its duplicate-detection pipeline has entered live testing across three partner archives in the city. The system, built on perceptual hashing and neural embedding comparison, is designed to flag near-identical images that differ only by cropping, colour correction or file format — the kind of clutter that has quietly ballooned inside municipal and institutional image libraries for more than a decade.
The problem is more costly than it sounds. When the same photograph exists in a database under five slightly different filenames, every downstream system — from public search portals to rights-management software — either has to process all five copies or risks surfacing the wrong version. For institutions managing tens of thousands of images, that overhead compounds quickly. The timing matters now because several Zurich organisations are midway through major digitisation pushes, and cleaning house before those projects finish is considerably cheaper than doing it afterwards.
Who Is Affected and Where
Two organisations confirmed as testing partners this week are the Zentralbibliothek Zürich on Zähringerplatz and the Stadt Zürich's own digital asset management unit, which oversees the photographic holdings of the Stadtarchiv on Neumarkt. Both institutions have collections that run into the hundreds of thousands of image files accumulated since the early days of digital photography in the mid-1990s. Staff at the Zentralbibliothek have reportedly been dealing with a manual review backlog since at least 2023, when a migration to a new content management system surfaced a larger-than-expected number of near-duplicate scans from the same original prints.
The Museum für Gestaltung, located at Ausstellungsstrasse 60 in Zürich West, is also understood to be monitoring the ETH project with interest, given its design collection spans both digitised analogue photographs and born-digital material from more recent acquisitions. The museum has not confirmed a formal partnership with the ETH group at this stage.
The technical core of the ETH tool is not new in principle — perceptual hashing has existed since at least 2004 — but the current iteration layers a secondary comparison using image embeddings generated by a convolutional neural network fine-tuned on archival photography. That combination is what allows the system to catch duplicates that were rescanned at higher resolution or converted from TIFF to JPEG, which simpler hash-only approaches miss entirely.
What the Numbers Look Like
In a preliminary internal report circulated to partner institutions in June 2026, the ETH group indicated that test runs on a sample collection of roughly 80,000 images turned up a duplication rate of approximately 12 percent — meaning around 9,600 files were flagged as redundant copies of images already present in the same database. Institutions reviewing those findings noted the figure aligned with informal estimates their own archivists had been using for internal planning, though staff had not previously had a systematic way to verify the scale. At the Zentralbibliothek, where digitisation contracts are priced partly on the number of unique assets handled, a 12 percent reduction in active file count has direct budget implications.
The ETH group is working under a grant cycle that runs through the end of 2026, which creates pressure to complete the live-testing phase and publish findings before the funding window closes in December. A public workshop for interested institutions is being planned for late September, likely at the ETH main building on Rämistrasse 101, where archivists from across the German-speaking region will be invited to review the methodology.
For organisations not involved in the current pilot, the practical advice from archivists familiar with the project is straightforward: avoid importing new image batches into legacy systems until duplicate-detection standards are clearer. Running even a basic hash-check on incoming files before ingestion costs almost nothing in processing time and can prevent problems that take staff weeks to untangle later. Institutions with active digitisation contracts may want to ask their service providers specifically whether near-duplicate detection is included in the scope — it often is not, unless the contract explicitly requires it.