21 Responses

  1. Drake
    Drake at | | Reply

    That’s awesome you got a Qt version working! – it works fine for me on Ubuntu.
    Do you know a way to make Webkit wait until any AJAX calls have completed before emitting the loadFinished() signal?

  2. Drake
    Drake at | | Reply

    > Perhaps you can look for some changes the AJAX calls will make in the page’s DOM

    yeah that would be a reasonable workaround.

    To start the AJAX calls I would need to trigger certain JavaScript functions, either directly, or indirectly through eg button click events.
    I suppose I could do that by modifying the HTML to put the necessary JavaScript functions in the onLoad event, but that is a hack. Do you know the proper way to trigger JavaScript events?

  3. Guilherme
    Guilherme at | | Reply

    OH, Man! Oh, man! Oh, man!

    You just made my day a lot happier! Thank you!
    A millon times, thank you! =)

  4. 校内网相册备份程序
    校内网相册备份程序 at |

    [...] Google吧,记得那天晚上中英文交杂的搜到三点,试了几种方法,最后发现一个比较靠谱的: Downloading a page’s content with python and WebKit [...]

  5. Evandro Myller Carvalho Vieira
    Evandro Myller Carvalho Vieira at | | Reply

    Hello. This article was very useful for me but I had to create something that fit my needs. Here[1]‘s the source, as a contibution.

    [1]: http://github.com/emyller/webkitcrawler

  6. wxuan
    wxuan at | | Reply

    I use your script to crawl the web page on windows, but the Chinese characters are garbled. Can you give me some advices to fix this problem? Thank you!

  7. Antonio
    Antonio at | | Reply

    Hi, got this error
    cannot connect to X server
    Wasn’t one of the requirements that the script run without X?
    Thanks

  8. Sina
    Sina at | | Reply

    I used the exact same code on Windows. But the output file still have all the original java scripts! Do you guys have any idea why?

  9. neumino » Blog Archive » Parallel crawling and javascript

    [...] using things like wget or curl was out of question. I eventually find a nice way to do things here. The last thing that I needed was to parallelize it (the webpages were quite [...]

  10. Suncokret
    Suncokret at | | Reply

    Tried those two expamles but with no luck.

    On Ubuntu 11.04 this link helped me http://github.com/emyller/webkitcrawler but explanations from Tudor were excelent.

    Ubuntu python-webkitgtk … can not be found in 11.04 repository.

    Thanks

  11. philips
    philips at | | Reply

    I make a web scraping tool using webkit and Pyside, help can help others.

  12. trash80
    trash80 at | | Reply

    @Suncokret – the package is ‘python-webkit’, which is pywebkitgtk. So much for meaningful names.

  13. Scraping AJAX web pages (Part 2) « The Ubuntu Incident

    [...] Downloading a page’s content with python and WebKit :: Downloading a page’s content afte… [...]

  14. Damian
    Damian at | | Reply

    Cannot grab with this… (cannot connect to most of resources) for example to my blog topsidershoes.org
    I am dont noe why. Tried two expamles.

  15. Kevin Audleman
    Kevin Audleman at | | Reply

    Tudor,

    What’s the correct way to run this script? When I try simply running it from the shell I get the “Cannot connect to the X server” error (even though I have xvfb installed).

    I have managed to get it to run but passing it as a parameter to xvfb-run (command below) though it only returns .

    xvfb-run -a -s “-screen 0 640x480x16″ python qttest.py -u=www.google.com -fout.html

    Thanks,
    Kevin

  16. Python - Javascript DOM?
    Python - Javascript DOM? at |

    [...] can help us mimic browsers more closely Found some interesting ones called Pywebkitgtk and PyQt http://blog.motane.lu/2009/07/07/dow…on-and-webkit/ Or even compiling something like firefox into xbmc? Not sure how feasible that is, but saw some [...]

  17. Rick
    Rick at | | Reply

    Thank you for this post, I have found it to be very informative. I am still having trouble getting the DOM after everything has loaded. The url in questions is “http://director.flyerservices.com/LCL/AccessibleFlyer/AccessibleCitySelector.aspx?OrganizationId=797d6dd1-a19f-4f1c-882d-12d6601dc376&BannerId=3d5f3800-c099-11d9-9669-0800200c9a66&BannerName=LOB&PublicationType=1&Language=en&Version=TEXT&NoRedirect=true&province=9″ Instead of getting a list of cities I just get the tags. Only success I have had is using Crowbar to render the contents and save to file but I would rather do everything in Python. Any suggestions?

  18. ballu
    ballu at | | Reply

    hi,
    thanks for sharing such a nice code and tip.
    However i am still getting segmentation fault (dump) with Qt,

    Do you know any reason for this

Leave a Reply