[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Best ways to search offline HTML files?
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 18
Thread images: 1
File: screenshot_w1.jpg (30 KB, 703x573) Image search: [Google]
screenshot_w1.jpg
30 KB, 703x573
I'm currently downloading a large forum thats going offline next month with HTTrack. I'm going to end up with ~ 70,000 threads & 1.7m posts.

As I'm doing it, I'm wondering - what are the best tools to search offline html files?

Maybe its a stupid question, I am hungover. If it is, please say why.

Thanks.
>>
>>54683612
recursive wget
>>
Reminds me of the old Palm PDAs, where you could download a website and view it offline. It was amazing.
>>
which forum, oniichan
>>
>>54683938
I second this.

Being a young /g/'er i tried to download all of Bulbapedia to my NAS.

I accidentally left the option for external links checkmarked.

It didn't end well.
>>
>>54683938
>>54685031

GuildWarsGuru

I'm not downloading the whole forum - just a targeted subset that contains the real discussion (and drama). I'm also doing it section by section to get the higher value stuff first. Its a bit of a sad case I guess. The site is really the last notable English-lang community for the game. Every other one has died, had accidental wipes, etc over the last couple years. The only place left once its gone will be reddit (yuk), but the community there is both tiny and has only sprung up over the last few years so doesn't have the good old days history.

Also, the whole forum would be 3x the quoted numbers in the OP, but so much of that is things like people selling items. I've pruned a lot of the junk pages out via filters of course - screw the user profiles, signle post views, external images, etc. Want to keep it as light (and fast) as I can.
>>
>>54685675
So are you just copying these for memories or what?
>>
>>54685868
pretty much. I don't play the game anymore, and i only visit the site maybe 2-3 times a year, but i still want a copy of it for when i get the itch to look at it.
>>
>>54685675
d
>>
>>54685675
how are you going to search your pages?
to be more precise what do want to search it for?
this is telling you what tools you should use and how you're going to prep your raw data
>>
>>54683612
grep
>>
Get raped and kill yourself, you retarded fucking faggot sack of shit with down syndrome.
>>
Web based:
- htdig : fast, but you'll need to rebuild index everytime you sync the website.

Commandline:
find /location -type f -exec grep -H "somethingtosearch" {} \;
>>
>>54683612
perl regular expressions
>>
grep -R 'your needle' path/to/haystack

super-fast:
ag 'your needle' path/to/haystack
https://github.com/ggreer/the_silver_searcher
>>
>>54683612
Spotlight would be the best.
>>
>>54683612
>Maybe its a stupid question, I am hungover. If it is, please say why.
It it stupid because all you need is grep.
>>
>>54685675
Is it not archived by archive.org? CBA to check, desu.
Thread replies: 18
Thread images: 1

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.