Zurich Archives and Tech Firms Move to Stamp Out Duplicate Images in Digital Collections This Week
A surge in redundant digital files is straining storage systems across the city's institutions, prompting a coordinated push to clean up the mess.
A surge in redundant digital files is straining storage systems across the city's institutions, prompting a coordinated push to clean up the mess.

Zurich's cultural and research institutions spent much of this week racing to address a growing headache in their digital archives: thousands of duplicate images clogging databases, inflating storage costs, and slowing down public access tools that residents and researchers depend on daily.
The problem is not new, but it reached a practical tipping point in recent days. ETH Zurich's library services confirmed it is running a structured deduplication sweep across its digitised holdings, a collection that spans more than 150 years of scientific photography, cartographic records, and manuscript scans. The sweep, which began on Monday 29 June, is expected to run through the end of July and involves automated hash-matching software flagging near-identical image files before human staff make final deletion calls.
The timing is no accident. Swiss data-storage pricing benchmarks published in the first half of 2026 show that enterprise cloud storage costs have risen roughly 12 percent year-on-year for mid-sized institutional accounts, according to a February 2026 industry survey by the Swiss ICT association ICT-Switzerland. For institutions already squeezed by flat public budgets, carrying redundant files is increasingly expensive rather than merely untidy.
The Stadt Zürich's own digital archive unit, housed at the Stadtarchiv on Alfred-Escher-Strasse, acknowledged this week that it completed a preliminary audit in June that identified duplicate image clusters across its digitised Baugesuch records — building permit documents going back to the 1960s. No final figure on the number of duplicates has been released, but the audit reportedly covered around 400,000 scanned files.
Beyond cost, the issue touches directly on usability. The city's open-data portal, accessible via data.stadt-zuerich.ch, lists more than 80 active datasets that include image assets. When duplicate files carry conflicting metadata — slightly different filenames, mismatched date stamps — they create confusion for researchers and for the machine-learning pipelines that cultural heritage projects now rely on. A researcher at the University of Zurich's Digital Society Initiative on Rämistrasse noted in a published blog post this week that duplicate-image noise in training data can degrade the accuracy of computer vision models by measurable margins, though the post stopped short of citing a specific figure.
Across the Limmat, the Zentralbibliothek Zürich on Zähringerplatz is taking a different approach. Rather than a one-time purge, it is piloting a preventive intake protocol: every new image batch donated or digitised on-site now passes through a perceptual-hash check before it enters the main repository. The pilot started on 1 July. Staff there said in a written institutional update that the goal is to avoid the kind of retrospective cleanup now under way at other institutions.
Private-sector players are also involved. Several Zurich-based software firms operating in the Technopark on Technoparkstrasse are marketing deduplication tools specifically built for GLAM institutions — galleries, libraries, archives, and museums. At least two of those firms have been in discussions with cantonal cultural bodies this month, according to procurement notices posted to the cantonal tender platform simap.ch.
The practical picture for anyone who uses these collections: expect some temporary gaps in search results over the coming weeks as records are pruned and re-indexed. The ETH library has posted a service notice advising users that its image search tool may return incomplete results between 7 July and 25 July while the deduplication process runs. The Stadtarchiv's Baugesuch portal is unaffected for now.
Longer term, institutions are pushing toward a shared metadata standard that would let Zurich's major archives cross-reference holdings before they ingest new material, reducing duplication at source rather than cleaning it up after the fact. Talks toward that standard are ongoing between ETH Zurich, the Zentralbibliothek, and Stadt Zürich archivists. No formal agreement has been announced yet, but participants have signalled they want a draft framework in place before the end of 2026.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Zurich
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News