[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
I'm about to scrape a whole lot of porn. Aiming for a f
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 26
Thread images: 2
File: lrg%20(1).jpg (71 KB, 500x656) Image search: [Google]
lrg%20(1).jpg
71 KB, 500x656
I'm about to scrape a whole lot of porn. Aiming for a few TB's with a scraper I built that will net me about 150 pictures/minute, and I'll be collecting mostly from aggregation sites like imagefap.com

Should I be worried about whether downloading that much, that quickly, will cause any trouble with the websites? I feel like there's a DDOS scare out there these days and that a large data transfer like I'm about to do would set off somebody's alarm. Obviously I know nothing about networking.

tl:dr Max transfer speed when scraping a lot of content from a website?
>>
>>53848524
nah bro as long as you don't drop connection its not a DDOS
>>
>>53848524.
>tl:dr Max transfer speed when scraping a lot of content from a website?


There are several methods firewalls or ids see ddos as. Configuration to block several connections from the same IP, or block when pages are visited out of order (like what you are doing, you are only downloading pics and not the pages or ads), or a download limit per connection, or see you using a user agent that it doesn't like (easy to fix), or more types of ddos detection types configurations and you can combine all those or just some of those rules together.

It might be better to reduce the pics to a normal speed on one connection only. It will take longer but most sites really only care about you are not draining resources from other users. Configure it to go whole hog for 5 mins, pause for an hour, then come back with way slower speeds and only one connection. Working on a site that looked at customers data consumption through bro ids I would see people drain my traffic like mad and I would block then, but if the people showed some respect to our site by not burning it out of the ground, then I would let them go ahead and get full use of their money. They showed me they knew we wanted our web site up and we know they wanted data.

Tldr don't be a huge jerk downloading vast amounts of data.
>>
any decent website will block you, send captchas or do something that will otherwise stop your scripts as soon as you start jamming their shit without consideration. i've been scraping shit for more than 5 years and you will be forced to be reasonable with your automation

also, if you're a retard about it, you break the system for everyone and it's just gonna make things harder. so low skilled programmers like you are the ones with the most to lose
>>
>>53848524

>porn

Degenerate.
>>
>>53848524
I gotta ask - what's the point?
Do these aggregation sites have some meticulous file structure that supports tags & categories and you have a very specific fetish? If all you're doing is pulling porn en bulk - why? Are you worried there won't be porn on the internet tomorrow?
>>
>>53852745

Maybe he wants to rehost it to earn shekels like some anon did on a previous thread.
>>
Also interested in this topic, but from a hoster perspective.

How would I counter this? Would fail2ban do this?
>>
>>53853022
Maybe you shouldn't host content openly on the internet if you don't want people downloading it.
>>
>>53853022
1. Block anything without useragent headers, because most people who scrape don't set them
2. Set redirects on your content, which should at least deal with people who don't know how to properly handle redirection
3. Take a look at iptables to see frequency of requests, if they exceed x per second block the ip
>>
>>53853089
How dense can you be?

Read
>>53850973
>>
what's the rush? steal slow, steal long, my friend
>>
Why fap to images when you can fap to videos?
>>
>>53850973
>also, if you're a retard about it, you break the system for everyone and it's just gonna make things harder.
So that's the issue. I know you can be a retard about it, but I don't have a frame of reference to know if I'm crossing the line or not. Have any advice of what measures I should be adding in to my scripts?

>>53850701
>It might be better to reduce the pics to a normal speed on one connection only.
What would be considered a "normal speed", and is it possible to regulate bandwidth from my script? Rest of your post was helpful too.

>>53852745
>I gotta ask - what's the point?
To focus my debilitating porn addiction into a positive learning experience.
>>
>>53853461
>What would be considered a "normal speed", and is it possible to regulate bandwidth from my script?

I'm not sure what a normal speed would be, but you can increase the delay between successive downloads using the sleep() function in Python's time module.

You might want to add a delay of like 1 second between downloads, to avoid putting a damper on their servers
>>
>>53848524
Ryan's Defcon talk is pretty interesting
https://www.youtube.com/watch?v=PADKIdSPOsc
>>
What if I scrape from behind Tor?
>>
>>53854796
Then you just piss more people off.
>>
>>53854874

Explain
>>
>>53854912
You're going to be using up the very limited bandwidth of tor node providing volunteers and further slowing down the already very slow network
>>
>>53854995
Here's an interesting concept:
Provide your own nodes in place of the bandwidth you're using, much like in IRL carbon trading schemes.
>>
>>53848524
amazon aws free tier is good for scraping, you can launch like a 100 vms, giving you 100 different ips.

use something like zmq to orchestrate them and you'll scrape like a mother fucker.

i used this method to scrape google search results 24/7, without hitting a captcha
>>
>>53848524
if you plan an writing your own webscraper, don't.
I had the same idea once, until I discovered that every porn site is already prepared for fags who want to scrape their content.

Just get something like jdownloader and pray that they support your porn of choice. Otherwise it's not worth the effort.
>>
File: 1459000566529.jpg (27 KB, 604x604) Image search: [Google]
1459000566529.jpg
27 KB, 604x604
Cuts deep man
>>
If you didnt disconnect its OK.
Unless you are using the sites api, and it has a rule on how fast you can send connections.
>>
>>53855612
explain for what? was it part of a project?
Thread replies: 26
Thread images: 2

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.