• Drake

    That’s awesome you got a Qt version working! – it works fine for me on Ubuntu.
    Do you know a way to make Webkit wait until any AJAX calls have completed before emitting the loadFinished() signal?

  • I’m afraid that’s not possible, because the loadFinished is a standard event that gets fired when the page has finished loading, whereas additional AJAX request are not. It’s even harder because AJAX calls are asynchronous, and even if a function was called, you don’t when the AJAX request will finish.

    Perhaps you can look for some changes the AJAX calls will make in the page’s DOM. Monitor the HTML code inside for these kind of changes and start when you detect them.

  • Drake

    > Perhaps you can look for some changes the AJAX calls will make in the page’s DOM

    yeah that would be a reasonable workaround.

    To start the AJAX calls I would need to trigger certain JavaScript functions, either directly, or indirectly through eg button click events.
    I suppose I could do that by modifying the HTML to put the necessary JavaScript functions in the onLoad event, but that is a hack. Do you know the proper way to trigger JavaScript events?

  • OH, Man! Oh, man! Oh, man!

    You just made my day a lot happier! Thank you!
    A millon times, thank you! =)

  • Pingback: 校内网相册备份程序()

  • Hello. This article was very useful for me but I had to create something that fit my needs. Here[1]’s the source, as a contibution.

    [1]: http://github.com/emyller/webkitcrawler

  • wxuan

    I use your script to crawl the web page on windows, but the Chinese characters are garbled. Can you give me some advices to fix this problem? Thank you!

  • Antonio

    Hi, got this error
    cannot connect to X server
    Wasn’t one of the requirements that the script run without X?
    Thanks

  • Install Xvfb!

  • Sina

    I used the exact same code on Windows. But the output file still have all the original java scripts! Do you guys have any idea why?

  • I don’t have access to a windows machine, so I can’t help you…

  • Pingback: neumino » Blog Archive » Parallel crawling and javascript()

  • Tried those two expamles but with no luck.

    On Ubuntu 11.04 this link helped me http://github.com/emyller/webkitcrawler but explanations from Tudor were excelent.

    Ubuntu python-webkitgtk … can not be found in 11.04 repository.

    Thanks

  • I make a web scraping tool using webkit and Pyside, help can help others.

  • @Suncokret – the package is ‘python-webkit’, which is pywebkitgtk. So much for meaningful names.

  • Pingback: Scraping AJAX web pages (Part 2) « The Ubuntu Incident()

  • Damian

    Cannot grab with this… (cannot connect to most of resources) for example to my blog topsidershoes.org
    I am dont noe why. Tried two expamles.

  • Kevin Audleman

    Tudor,

    What’s the correct way to run this script? When I try simply running it from the shell I get the “Cannot connect to the X server” error (even though I have xvfb installed).

    I have managed to get it to run but passing it as a parameter to xvfb-run (command below) though it only returns .

    xvfb-run -a -s “-screen 0 640x480x16” python qttest.py -u=www.google.com -fout.html

    Thanks,
    Kevin

  • Pingback: Python - Javascript DOM?()

  • Rick

    Thank you for this post, I have found it to be very informative. I am still having trouble getting the DOM after everything has loaded. The url in questions is “http://director.flyerservices.com/LCL/AccessibleFlyer/AccessibleCitySelector.aspx?OrganizationId=797d6dd1-a19f-4f1c-882d-12d6601dc376&BannerId=3d5f3800-c099-11d9-9669-0800200c9a66&BannerName=LOB&PublicationType=1&Language=en&Version=TEXT&NoRedirect=true&province=9” Instead of getting a list of cities I just get the tags. Only success I have had is using Crowbar to render the contents and save to file but I would rather do everything in Python. Any suggestions?

  • ballu

    hi,
    thanks for sharing such a nice code and tip.
    However i am still getting segmentation fault (dump) with Qt,

    Do you know any reason for this

Advertisment ad adsense adlogger