42 Responses

  1. Radu
    Radu at | | Reply

    Definitely I have to try and write some more Python code when I finish with my school exams. It’s just fucking amazing how nice and fast you get productivity for things that in other languages it would take you at least third times more lines of code (Python vs. Java/C/C++/C#).

    You could say it’s laziness over speed (comparing it to C/C++), but after all productivity is more important than speed and more important than these two is readability, which Python has it as a feature in it, not as a virtue of the programmer.

  2. Downloading a page's content with python and WebKit :: Downloading a page's content after the javascript executed | Tudor Barbu's professional blog

    [...] been bragging with this post for quite some time now. Well, I won’t do that any more, because it seems that pywebkitgtk [...]

  3. Sergey
    Sergey at | | Reply

    Oh, man! I was looking exactly for this for several hours already :) I’m totally new to python/webkit and it was not obvious for me how I can solve a problem. Thanks a lot!

  4. Oliver
    Oliver at | | Reply

    hi

    Thanks for the examples
    im working on windows and I get this error:

    Do you have any clue about what could be causing it?
    Thanks

  5. Oliver
    Oliver at | | Reply

    Thanks
    the Python version is 2.6.1
    At home I use Ubuntu and it works, at work I’ll install a virtual machine with linux.

    Thanks for your answer

  6. Michael
    Michael at | | Reply

    Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.

    When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.

  7. Michael
    Michael at | | Reply

    Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.

    When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.

  8. Nicholas Herriot
    Nicholas Herriot at | | Reply

    This may help people, the very first example does not work on Ubuntu 9.10. The GTK threads are not initialized.

    I found that by adding:

    gtk.gdk.threads_init()

  9. Nicholas Herriot
    Nicholas Herriot at | | Reply

    Opp’s did not finish my post!

    Yes you need to add the gtk.gdk.threads_init() to initialize GTK threads. I added just after my last import.
    You also have to add the import statement to use threads.

    Hence:
    import threads

    Hope this helps some one…. ;-)

  10. Nicholas Herriot
    Nicholas Herriot at | | Reply

    Got a question. On the last example you don’t define the ‘sys’ object anywhere. So how did this work?

    Hence line:

    >print ‘You must specify an URL.’,sys.argv[0],’–help for more details’

    will cause a python error.

    NameError: global name ‘sys’ is not defined

    What is the proper syntax? As sys seems to have a list of arguments?

  11. Florentin
    Florentin at | | Reply

    Tudor, you saved me a lot of work with this tutorial so I send you many thanks
    I have discovered that you can use the ‘console-message’ signal to read content
    from the html.

    New code:

    This works well if there are no other console.log messages from the source html.
    If there are, you can add some keys to your console message to recognize their yours.

    In self._finished_loading you can add thirt party javascript sources, run javascript commands though view.execute_script… then do view.execute_script(“console.log…”) to get the content back to the python script.

  12. BrochesterL
    BrochesterL at | | Reply

    Definitely Use the main_resource from the main frame as you will see in the python api for webkit it is what holds on to the bulk from all the ajax type stuff running, only problem is I havent been able to render javascript and “dump” (as it were), an entire page from this api, It needs some rethinking to be a true crawler.
    but here is what is required to view your gmail inbox in pywebkitgtk,
    if you run the pywebkit browser demo

    when in your gmail inbox will dump all your mail from lists in javascript source
    Now If I could Just get the links extracted so I can navigate and recursively download my mail. Only because current provider blocks all SMTP.!!!

  13. ryan
    ryan at | | Reply

    it seems the code ‘s execution has nothing to do with python-webkitgtk, as I had not install it, but the code above can and do the work correctly~~

  14. Jason
    Jason at | | Reply

    It’d be great if anyone could write a Scrapy extension for this.

  15. Ajijor
    Ajijor at | | Reply

    Thanks !
    This is wicked.

    Interestingly, I get an exception for the import of threads.
    I am on Ubuntu10.10 with Python 2.6.6.
    I removed the import and stuff seems to be working ….

  16. Sérgio Basto
    Sérgio Basto at | | Reply

    replace
    gtk.gdk.threads_init()
    to
    import gobject
    gobject.threads_init()

    works even better :)

  17. Passerby
    Passerby at | | Reply

    Hi,

    Using selenium can be a solution as well

  18. Sourabh Singi
    Sourabh Singi at | | Reply

    I have a problem installing the python-webkitgtk, I googled for the solution, but cudnt find much help.

    Anyone who have faced the same problem and solved it?

  19. Ignacio
    Ignacio at | | Reply

    hi.
    I have copy and paste your code, but
    I don’t know why:

  20. Danilo
    Danilo at | | Reply

    I need help installing pywebkitgtk in windows…please =(

  21. Rune K. Svendsen
    Rune K. Svendsen at | | Reply

    Here are packges I had to install in Ubuntu 12.04 to get it working (at least the test works):

  22. Saksow
    Saksow at | | Reply

    I’ve tried all the pyqt and pygtk webkit scripts for js scraping but I have always the same problem: The script works as I want only the first time I run it, then it goes random… Someone tried this script multiple times on the same url?

  23. Python crawl ajax page uses pywebkit and threadpool?

    [...] I want to crawl web pages rendered by Ajax, I refered to this post: http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/ [...]

  24. Python crawl ajax page uses pywebkit and threadpool?

    [...] want to crawl web pages rendered by Ajax, I refered to this post: http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/ it works. But I want to know how to use multi gtk instance run simultaneously to crawl pages. I [...]

  25. Yi
    Yi at | | Reply

    how to combine pywebkit with multi thread ?
    anyone know how to crawl pages simultaneously?

  26. Giovanni
    Giovanni at | | Reply

    Hi, i would implement your code with the possibility of read the file .js with actions… i ‘ve made this:

    so now i call -u URL -f FILE -j FILE.JS
    but this doesn’t work.
    need help

  27. Giovanni
    Giovanni at | | Reply

    my last problem is gone… now i have a new one.
    would i take the source code from a page without javascript are executed,??

  28. Sergey
    Sergey at | | Reply

    Thank you so much for this artikle its just amazing!!!! I was looking for solution like this for the last ‘few’ days! Was looking at things like Selenium, python-spidermonkey, pyv8, PhantomJS, was trying to connect ELinks( Linux text browser) to the python and many other things! That’s just great!

  29. gion
    gion at | | Reply

    For all windows/osx lovers .. you can use pyside(it also has webkit) instead of gtk, go here http://qt-project.org/wiki/PySideDownloads.

    For your task, I recommend using http://code.google.com/p/selenium/wiki/ChromeDriver and if you want it headless, you can use a frame buffer server (Xvfb)
    I am sure there is a way to inject script with some of these “selenium” bricks.

    Good luck!

  30. Part
    Part at | | Reply

    The PySide link is down. Is it down only for me or has something happen to the site ? thanks

  31. Ekta
    Ekta at | | Reply

    Tudor – Very useful post, could you also point to how this would work with pages that require an authentication(passing the inputs via a POST call to the form)
    I tried using BeautifulSoup,Selenium (with a lot of tweaks, which is how I landed on your webpage)and PyQt4 (Webkit)as in here -http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/. While webkit was pretty neat (See sample render example on that page), I kept getting errors while trying to “See”/print the html object that I saved, in short that did not solve my problem of redirection post “authetication” when using the Webkit’s “middleware”

    Question – With your module above, can I still plugin BeautifulSoup post the total render as from “_finished_loading” is finished? Would be great if you could me specific pointers/redirect me on this.
    The redirection logic, works seamlessly with Cookie cache(urlib2) and BeautifulSoup, I was hoping I could re-use that learning here, Can I ?

  32. JoeyChan
    JoeyChan at | | Reply

    Hey,Tudor: The program code you provide doesn’t work for some urls, such as the site: http://www.csdn.net. I didn’t obtain the content which is generated by the JavaScript code after loding the page.Can you give some ideas? Many thanks.

  33. Problems adjusting | nerdy world problems

    [...] and documentation on how to use it for what we need have been pretty scarce (although we have found a couple). I’ve been trying to write a script that uses pywebkitgtk to work with the issuer [...]

  34. zyong
    zyong at | | Reply

    Hey Tudor: I want to use pywebkit to test page, I hope that it run without disturbing, I don’t need window to show, just run program to exec js and python, can it satisfied me to do it.

Leave a Reply