ETH Zurich's Image Archive Lab confirmed this week that a pilot sweep of the Swiss National Museum's digitised collection identified more than 14,000 duplicate or near-duplicate image files, the largest such audit the two institutions have jointly conducted. The finding, reported in an internal technical brief circulated on July 2nd, is prompting urgent conversations across the city's cultural sector about data hygiene, storage costs, and the reliability of public-facing digital collections.
The timing matters. Swiss federal cultural funding is under pressure after the Confederation trimmed the Pro Helvetia budget by 4.2 percent for the 2025–2028 period, squeezing institutions that spent heavily on digitisation drives during the pandemic years. Storing redundant image files is not a trivial expense: cloud archival rates for uncompressed TIFF files used by museums run to roughly CHF 0.023 per gigabyte per month on standard Swiss-hosted infrastructure, and a single high-resolution scan of a textile or painting can exceed 800 MB. Multiply that across tens of thousands of duplicates and the annual waste runs into six figures.
What the Week's Audit Revealed
The sweep used a perceptual hashing algorithm developed within ETH Zurich's Distributed Computing Group, based on the Hönggerberg campus. The technique compares image fingerprints rather than raw file data, catching near-duplicates—slightly different crops, colour-corrected versions, or files re-exported under different metadata—that a simple byte-for-byte check would miss. Of the 14,000-plus flagged files in the Swiss National Museum's Landesmuseum Zürich collection, around 3,800 were exact duplicates and the remainder were near-matches requiring human review.
Zentralbibliothek Zürich, which holds one of the largest photographic archives in the German-speaking world, confirmed it has begun a parallel internal review using similar methodology after learning of the Landesmuseum findings. A spokesperson for the library, which sits on Zähringerplatz in the Hochschulen district, said staff are prioritising the Graphische Sammlung's digitised print holdings, where duplication is suspected to be particularly high due to successive scanning campaigns run in 2018, 2021, and 2023.
The deduplication effort sits within a broader push coordinated through the Zurich cantonal government's Kulturdigitalisierung initiative, which has channelled CHF 2.1 million into digital preservation projects since 2022. Critics of that programme have argued publicly that too much money went into raw scanning contracts and too little into data management standards—a complaint the current audit appears to validate. The city's archives office on Neumarkt has not yet said whether it will commission a similar review of its own photographic holdings.
What Institutions Should Do Now
Archivists and digital preservation specialists contacted this week say the Zurich findings are not unusual by international standards but are unusually well-documented. The British Library estimated in a 2024 report that 8 to 12 percent of digitised cultural heritage collections worldwide contain significant duplication—a figure that aligns closely with what the ETH team found at the Landesmuseum.
For institutions still running their own reviews, the practical advice is to act before the next budget cycle closes in autumn. The ETH Distributed Computing Group has made its perceptual hashing tool available to Swiss public institutions under a non-commercial licence; documentation is accessible through the ETH Zürich Research Collection portal. Any institution that completed a major scanning project between 2019 and 2024 without a post-processing deduplication step should treat itself as high-risk.
The Landesmuseum expects to complete its human review of the 10,000-plus near-duplicate flagged files by September, after which it will publish a methodology document it hopes other Swiss cantonal museums can adapt. Zentralbibliothek Zürich is aiming for a preliminary report by late August. Both institutions say the cleaned collections will improve search accuracy for researchers using their public portals—a practical benefit beyond the storage savings that has drawn quiet interest from the universities and from private digital humanities projects funded through the Swiss National Science Foundation.