I'm currently downloading a large forum thats going offline next month with HTTrack. I'm going to end up with ~ 70,000 threads & 1.7m posts.
As I'm doing it, I'm wondering - what are the best tools to search offline html files?
Maybe its a stupid question, I am hungover. If it is, please say why.
Thanks.
>>54683612
recursive wget
Reminds me of the old Palm PDAs, where you could download a website and view it offline. It was amazing.
which forum, oniichan
>>54683938
I second this.
Being a young /g/'er i tried to download all of Bulbapedia to my NAS.
I accidentally left the option for external links checkmarked.
It didn't end well.
>>54683938
>>54685031
GuildWarsGuru
I'm not downloading the whole forum - just a targeted subset that contains the real discussion (and drama). I'm also doing it section by section to get the higher value stuff first. Its a bit of a sad case I guess. The site is really the last notable English-lang community for the game. Every other one has died, had accidental wipes, etc over the last couple years. The only place left once its gone will be reddit (yuk), but the community there is both tiny and has only sprung up over the last few years so doesn't have the good old days history.
Also, the whole forum would be 3x the quoted numbers in the OP, but so much of that is things like people selling items. I've pruned a lot of the junk pages out via filters of course - screw the user profiles, signle post views, external images, etc. Want to keep it as light (and fast) as I can.
>>54685675
So are you just copying these for memories or what?
>>54685868
pretty much. I don't play the game anymore, and i only visit the site maybe 2-3 times a year, but i still want a copy of it for when i get the itch to look at it.
>>54685675
how are you going to search your pages?
to be more precise what do want to search it for?
this is telling you what tools you should use and how you're going to prep your raw data
>>54683612
grep
Get raped and kill yourself, you retarded fucking faggot sack of shit with down syndrome.
Web based:
- htdig : fast, but you'll need to rebuild index everytime you sync the website.
Commandline:
find /location -type f -exec grep -H "somethingtosearch" {} \;
>>54683612
perl regular expressions
grep -R 'your needle' path/to/haystack
super-fast:
ag 'your needle' path/to/haystack
https://github.com/ggreer/the_silver_searcher
>>54683612
Spotlight would be the best.
>>54683612
>Maybe its a stupid question, I am hungover. If it is, please say why.
It it stupid because all you need is grep.
>>54685675
Is it not archived by archive.org? CBA to check, desu.