[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
I need to download all postings of a large forum. It has around
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 25
Thread images: 2
File: a.png (522 KB, 1134x768) Image search: [Google]
a.png
522 KB, 1134x768
I need to download all postings of a large forum. It has around 800k members. All content and links are public. The forum is limited to around one connection per second. To not download for half a year, I am looking for someone with a lot of IPs under his control to download the stuff for me in parallel. Like, a botnet herder. What would be a good place to look for this?
>>
>>54467383
Go ask the admin for a dump
>>
>>54467383
Not here.
>>
Hmm. Low chance of success, but it would remove some traffic from them. I'll try, I guess.
>>
>>54467446
this
>>
>>54467383
AWS, or dump.
>>
> AWS, or dump.
With AWS, I'd need to rent a few dozen independent machines to have a few dozen different IPs, I suppose?
>>
Why do you need to download everything?
>>
> Why do you need to download everything?
I try to correlate written stuff to one of the forums members via linguistic methods. Basically to catch my hacker.
>>
>>54467383


IPs = $$$

No one will do it free for you.
>>
Just reset your router every time you idiot. New IP for free
>>
>>54467446
How likely is it to get a dump this way? Anyone tried it before? It makes sense for admins to provide the dump rather than endure scraping?

I usually run wget or a Python script from a bunch of machines. After all, you rent them by hour so it isn't expensive. It probably looks like shit on the server, though.
>>
>>54468011
Completely depends on the admin and how you ask. Got a rough estimate on cost to rent those machines?
>>
>>54468050
Check out the hourly pricing on, Digital Ocean, for example:
>https://www.digitalocean.com/pricing/

You set up one machine (the cheapest one will do for scraping - $ 0.007/hr) and just clone it a bunch of times. I don't know if there's a limit on how many you can set up at once. I think Digital Ocean says in their ToS that you could be shut down for scraping, if someone complains, but I never went near causing a denial of service attack and choose a slow pace to scrape at.
>>
>>54467897
Noone said I expect to get that for free.
I'll pay up to a hundred bucks in Bitcoin for that.

>>54467932
You have no clue whatsoever. The problem is rate-limiting per IP, not "banning" or whatever you experience when you have to reset your router. Protip: you can simply reconnect too.

>>54468011
I started with httrack, where I can set connections/sec and threads. I am effectivly limited to 1 html download per second, taking ten days only to download a list of the users.
>>
>>54467383
>limited to around one connection per second
so dont close the connection? (use pipelining)
or are you trying to say one request per second?
>>
>>54468435
also why dont you download the not so recent stuff from google cache and then only down load the diff from source? if that shit is Biblically available google must have indexed it
>>
>>54468471
>Biblically
kek, meant to say publicly, fucking autocorrect
>>
>>54468471
Not cached by google, but it's on archive.org. Will check if there's a useable mirror there, thanks!
>>
>>54468471
>>54468544
In fact it is cached by google! I'll look into that, although at some point archive.org and google will rate-limit or block me as well.
>>
>>54468577
i doubt google even cares about the rate you fetch cached read-only content from them, they probably have that shit stored across multiple instances and with each request you'll be just ping-ponging between them
they have _massive_ data uploading capabilities, your requests are like a rain drop in an ocean
gl
>>
>>54468655
Very true! Still, why do I have to solve captchas to prove I am a human on google search, every other month? Ah well, every independent source helps! archive.org, google and the original forum. Then with three different IPs, that's a tenfold speedup already. Maybe I can do it on my own, after all! Thank you, helpful anon.
>>
>>54468218
>hundred bucks to scrape an entire site with bots
wew lad
>>
>>54468695
>why do I have to solve captchas to prove I am a human on google search
cause the last gateway between google and you is used by more people? (you have a shared ip address)

and i bettrr leave it at that, i dont really wanna ask on what country you live or this thread will turn to shit
>poo in loo?
god damn it, i cant help it, sry
>>
File: realretarded.jpg (130 KB, 530x692) Image search: [Google]
realretarded.jpg
130 KB, 530x692
>>54467932

>what is DHCP lease time
Thread replies: 25
Thread images: 2

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.