[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y ] [Home]
4chanarchives logo
Webscrapping, wtf?!
Images are sometimes not shown due to bandwidth/network limitations. Refreshing the page usually helps.

You are currently reading a thread in /g/ - Technology

Thread replies: 9
Thread images: 5
File: I_Want_THIS.png (301 KB, 1679x1008) Image search: [Google]
I_Want_THIS.png
301 KB, 1679x1008
I want to extract a single number from a website into a .csv-file every second throughout the day, I have done that using iMacros but have to have a browser opened for that throughout the day so I want to do it with R/Python/C++/C#/JScript/Java. I watched a lot of tutorials and read examples, but every time I try applying it on my desired website, I just get "- -" as value, so no value at all. I figure it has something to do that its dynamically generated or something (.aspx -website?). Here the URL:

http://www.zertifikate.commerzbank.de/MarketOverview/MarketOverviewDetails.aspx?pc=42&c=2193946&ar=.GDAXI&a=15000&isin=DEM_DAX_CASH&mkt=CBUL&pname=DAX&pdp=2

See also pic related, I extracted the whole HTML-code, but there are still no values there, some XPath-approaches in R didn't work either. Please help, any solutions?
>>
It's possible that the number is generated dynamically after the page is loaded. I think the selenium webdriver might be one way to approach it.
>>
File: iMacros.png (81 KB, 838x1005) Image search: [Google]
iMacros.png
81 KB, 838x1005
Its as if the website protects all of its information, only through iMacros Extraction have I managed to extract the info (pic related), but that's not a long term solution.
>>
>data-field
The numbers are rendered using js but don't lose hope yet there's obviously an API endpoint which serves the page
>>
>>53715236
thank you for the hint, will look into that

>>53715253
thank you, currently looking up 'Scrapping JS generated data with R/Python', hopefully this will solve it finally.
>>
>>53715415
I haven't tried selenium before, but I hear it helps.

Also, I was taking a look at the html and found this: http://www.zertifikate.commerzbank.de/Products/ProductGraphPopoutPage.aspx?isin=DEM_DAX_CASH&mkt=CBUL&pname=DAX

This should help you out since it lowers the amount of clutter and you get the number you want. Take a look at the browser dev tools like the console. It seems that the cbcm object makes a connection to the a lightstreamer server like http://warrantspushserver.commerzbank.de/
>>
File: feelbonacci.png (107 KB, 841x797) Image search: [Google]
feelbonacci.png
107 KB, 841x797
Apparently data is being live streamed as in instant messaging service with Lightstreamer. There're some demos on their website which'd be helpful to reverse engineer on. Good luck.

http://demos.lightstreamer.com/?p=lightstreamer&t=client&f=finance
>>
File: CBK.png (255 KB, 839x1005) Image search: [Google]
CBK.png
255 KB, 839x1005
>>53715589
thanks a lot for that it really removes some clutter, if I go into the html source however, it still remains unextractable, and you're right about the communication with the push-service lightstreamer.
>>
File: network.png (291 KB, 1706x1006) Image search: [Google]
network.png
291 KB, 1706x1006
>>53715589
However, by changing timeframes of the graph I managed to pin down where the data for the chart is coming from, under the dev-tools and network section I discovered that it draws from a simple plain text webpage with historical data which is being updated throughout the day (pic related), seems like my problem is solved.

Thanks a lot! Now I will just have to wait and see if that page is actually filled with data as frequently (every second) as the webpage is displaying new data.
Thread replies: 9
Thread images: 5

banner
banner
[Boards: 3 / a / aco / adv / an / asp / b / biz / c / cgl / ck / cm / co / d / diy / e / fa / fit / g / gd / gif / h / hc / his / hm / hr / i / ic / int / jp / k / lgbt / lit / m / mlp / mu / n / news / o / out / p / po / pol / qa / r / r9k / s / s4s / sci / soc / sp / t / tg / toy / trash / trv / tv / u / v / vg / vp / vr / w / wg / wsg / wsr / x / y] [Home]

All trademarks and copyrights on this page are owned by their respective parties. Images uploaded are the responsibility of the Poster. Comments are owned by the Poster.
If a post contains personal/copyrighted/illegal content you can contact me at [email protected] with that post and thread number and it will be removed as soon as possible.
DMCA Content Takedown via dmca.com
All images are hosted on imgur.com, send takedown notices to them.
This is a 4chan archive - all of the content originated from them. If you need IP information for a Poster - you need to contact them. This website shows only archived content.