[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
How would you read the text from a webpage with Java? Without
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 22
Thread images: 1
File: cs.jpg (669 KB, 1280x1240) Image search: [Google]
cs.jpg
669 KB, 1280x1240
How would you read the text from a webpage with Java? Without 3rd party libraries


Stackoverflow suggested using the Html.fromHtml method but it doesn't seem to be part of Java 8.
>>
>>52649295
Just use Html.fromHtml
>>
>>52649295
streamreader
>>
>>52649482
That's how i download the entire page but i need something to distinguish the html shit from actual text
>>
you need to be more specific about "read the text"
>>
>>52649295
>using java to scrap web pages
aaaand another fine example of using the wrong tech
>>
Regexes
>>
>>52649517
not op, but what would you suggest?
>>
>>52649500
parse the header, that will tell you how big it is, cut it off, the rest is html, if you want to parse the html from scratch you can gf.
>>
>>52649534
Python + beautiful soup
>>
>>52649534
That would probably be a 3 or 4 poor lines of code in PHP without any lib. Python comes to mind too.
>>
>>52649527
:^)
>>
>>52649534
Python + default HTMLParser
>>
>>52649534
Python
>>
>>52649295
Use Python, Scala, Ruby or Perl.

They're some languages with the not so annoying parsing.

Java has decent parsing power, but using it is so fucking verbose and annoying...
>>
>>52649614
OP here: I have to use java for this shitty uni class
>>
>>52649644
Uni class... so no Jsoup or HtmlCleaner or Jericho either? (Three decent HTML parser libs for Java.)

Well, I guess you'll just have to deal with the verbosity. It's not hard. Just annoying. Too much work for real life projects, eh.
>>
>>52649527
How do you read something like this?
>replaceAll("\\<[^>]*>","")

Replace all \\ and everything between < >?


>>52649702
Nope, no 3rd party libraries. Sudoku inbound
>>
>>52649731
> Nope, no 3rd party libraries. Sudoku inbound
It's not *that* bad, DESU. Just verbose.

Feel free to do the same exercise with one of the languages I suggested, should be a easier when it's some ugly ass real life HTML. If it's only an exercise HTML or one of the few neat web pages, there's nothing to be afraid of...
>>
>>52649808
>Feel free to do the same exercise with one of the languages I suggested
This is not bad advice, desu. But if you're strapped for time, don't bother obviously.

I suggest using regexes. Just Google regex helper or something like that (I think I use regex101) and copy and paste the file you're supposed to be parsing into there. Fiddle with the regex until everything you're looking for is selected, then boom you're done. All you've got to do is look through the Java documentation for how to apply that bad boy and you're good to go.

Regexes are bullshit to learn, but they're too useful not to use in the long run. Might as well get started now.
>>
>>52650150
>using regex to parse xml
kill yourself
>>
>>52650175
How do you suggest he do it in Java with no external libs then? I don't know enough about Java and its standard library to think of any less insane solution right now.
Thread replies: 22
Thread images: 1

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.