File: 9fee46955193aee74345648d1a2697c7ef50efb5c5d05b6f68b578597725cebf.jpg (dl) (169.37 KiB)
/g/ - Technology
install openbsd
[Make a Post]Kiwix ZIM files
Has all Wikimedia and StackExchange sites for offline browsing. The ZIM format is a custom format for XZ-compressed web content (e.g., HTML) you can also download the "portable" files (.ZIP) which include the search index as well.
Some places where you can download ZIM files:
https://ftp.fau.de/kiwix/
https://mirrors.dotsrc.org/kiwix/
https://download.kiwix.org/
https://ftp.nluug.nl/pub/kiwix/
https://ftp.acc.umu.se/mirror/kiwix.org/
https://mirror.isoc.org.il/pub/kiwix/ (Israeli server)
>>1894
Damn that image turned out shit when thumbnailed and compressed.
File: e16886dacf1b73adca017e46d157ac1dd1caff5e814a367b9d931edb20b2f8f4.png (dl) (6.19 KiB)
Common Crawl
https://commoncrawl.org/
>The Common Crawl corpus contains petabytes of data collected over 8 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
>Access to the Common Crawl corpus hosted by Amazon is free. You may use Amazon’s cloud platform to run analysis jobs directly against it or you can download parts or all of it.
Downloading the raw WARC, metadata, or text extracts is probably way too much for any individual to donwload, though CommmonCrawl also releases Pagerank and hosts-to-hosts link metadata, for example:
>Host- and Domain-Level Web Graphs Aug/Sep/Oct 2018
https://commoncrawl.org/2018/11/web-graphs-aug-sep-oct-2018/
>Host-level graph
>5.66 GB cc-main-2018-aug-sep-oct-host-vertices.paths.gz nodes ⟨id, rev host⟩, paths of 42 vertices files
>23.60 GB cc-main-2018-aug-sep-oct-host-edges.paths.gz edges ⟨from_id, to_id⟩, paths of 98 edges files
>9.63 GB cc-main-2018-aug-sep-oct-host.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-host.properties
>10.83 GB cc-main-2018-aug-sep-oct-host-t.graph transpose of the graph (outlinks inverted to inlinks)
>2 kB cc-main-2018-aug-sep-oct-host-t.properties
>1 kB cc-main-2018-aug-sep-oct-host.stats WebGraph statistics
>13.47 GB cc-main-2018-aug-sep-oct-host-ranks.txt.gz harmonic centrality and pagerank
>Domain-level graph
>0.60 GB cc-main-2018-aug-sep-oct-domain-vertices.txt.gz nodes ⟨id, rev domain, num hosts⟩
>5.95 GB cc-main-2018-aug-sep-oct-domain-edges.txt.gz edges ⟨from_id, to_id⟩
>3.24 GB cc-main-2018-aug-sep-oct-domain.graph graph in BVGraph format
>2 kB cc-main-2018-aug-sep-oct-domain.properties
>3.39 GB cc-main-2018-aug-sep-oct-domain-t.graph transpose of the graph
>2 kB cc-main-2018-aug-sep-oct-domain-t.properties
>1 kB cc-main-2018-aug-sep-oct-domain.stats WebGraph statistics
>1.89 GB cc-main-2018-aug-sep-oct-domain-ranks.txt.gz harmonic centrality and pagerank
You can search their URL index here:
http://index.commoncrawl.org/
Here are the instructions for downloading the CommonCrawl data:
https://commoncrawl.org/the-data/get-started/
File: 593249a2f2f5be70d0b8ddff764c99aa70f48ee6e67818dac472b5b8862a2baa.png (dl) (1.83 KiB)
The Pirate Bay (no seed/leech ratio metadata)
https://thepiratebay.org/static/dump/csv/
Kickass Torrents June 2015
https://web.archive.org/web/20150609001718if_/http://kat.cr/dailydump.txt.gz (~640 MB)
Info: https://web.archive.org/web/20150518164224/https://kat.cr/api/
TorrentProject July 2016
https://web.archive.org/web/20160721213429if_/https://torrentproject.se/dailydump.txt.gz (~610 MB)
Info: https://web.archive.org/web/20160721213302/https://torrentproject.se/api
Bitsnoop May 2016
https://web.archive.org/web/20160327181910if_/http://ext.bitsnoop.com/export/b3_all.txt.gz
https://web.archive.org/web/20170324033525/https://bitsnoop.com/info/api.html
OfflineBay (Electron-based software)
https://github.com/techtacoriginal/offlinebay
https://www.offlinebay.com/
https://pirates-forum.org/Thread-Release-OfflineBay-v2-Open-source-and-No-more-Java-dependency
>>1896
>Damn that image turned out shit when thumbnailed and compressed.
I think hokage's thumbnailer routine is prioritizing file size over the image not looking like shit.
[Catalog][Overboard][Update]
[Reply]5 replies
A thread for sharing data dumps, below are some dumps to get the thread going:
4chan/pol/ 2013-2019 (18 GB for posts; 42.4 GB for thumbnails)
https://archive.org/details/4plebs-org-data-dump-2019-01
https://archive.org/details/4plebs-org-thumbnail-dump-2019-01
Reddit 2006-2018 (446 GB for comments; 145 GB for submissions)
https://files.pushshift.io/reddit/
Gab.ai 2016-2018 (4.06 GB)
https://files.pushshift.io/gab/
Hacker News 2006-2018 (2.04 GB)
https://files.pushshift.io/hackernews/
Google Books Ngrams 1505-2008 (lots of GB if you want 3+ Ngrams)
https://storage.googleapis.com/books/ngrams/books/datasetsv2.html
Stack Exchange till 2018-12 (~59 GB)
https://archive.org/details/stackexchange
GB != GiB