A few people have been writing to ask about an update to the English Wikipedia ZIM file, and we figured this would warrant a post of its own.
TL;DR: the last run failed with new bugs, we’re releasing mediawiki offliner 1.15.1 and pushing through, new ETA is end June.
About 99% of Wikipedia zim files are now back on a monthly update schedule. The remaining 1% (24 wikis) are impacted by a variety of edge cases, listed here. It is not always the big wikis that fail, but chances of encountering such edge cases are mechanically higher when there are many articles to crawl.
- Roughly 2/3 of the failed recipes are related to errors preventing the retrieval of a given article; we’ve chosen to replace these errors on a case-by-case basis, only after analysis; we continue to discover new errors (InvariantException, etc.). Some of these are actually on the Wikimedia end of things, but we are talking and they have people working on them as well.
- 1/3 of the failures remain scraper bugs to be fixed. Some are fairly trivial and are being hot-fixed in 1.15.1, some are more subtle and will be fixed in milestone 1.16.
- The problem is that we don’t when/where failures will occur. As a reminderm it takes anywhere between 6 to 20 days of crawl and compute to generate a zim file with 7M entries. If this happens at the beginning of the crawl, it sucks. If it happens at the end of the crawl, this sucks big time.
Current timeframe
The English Wikipedia bug is part of the 1.15.1 milestone and is therefore priority: it should restart before the end of this week.
By the end of June, we will also have fixed most of the other impactful bugs listed in 1.16 and will restart the remaining recipes.
Seeing how long it has been (and how many have been asking/waiting for an update) we are also seriously considering accepting a number of missing entries : 100? 1000? Out of 7 million entries it is peanuts, but if for some reason some of the missing entries (which we can not predict) are on this list, it is a problem. We’ll probably go ahead anyway, but we’ll cross that bridge when we’re there.