• Definitely I have to try and write some more Python code when I finish with my school exams. It’s just fucking amazing how nice and fast you get productivity for things that in other languages it would take you at least third times more lines of code (Python vs. Java/C/C++/C#).

    You could say it’s laziness over speed (comparing it to C/C++), but after all productivity is more important than speed and more important than these two is readability, which Python has it as a feature in it, not as a virtue of the programmer.

  • Pingback: Downloading a page's content with python and WebKit :: Downloading a page's content after the javascript executed | Tudor Barbu's professional blog()

  • Oh, man! I was looking exactly for this for several hours already 🙂 I’m totally new to python/webkit and it was not obvious for me how I can solve a problem. Thanks a lot!

  • Try the Qt version, as it worked better for me! More details here.

  • Oliver

    hi

    Thanks for the examples
    im working on windows and I get this error:

    Do you have any clue about what could be causing it?
    Thanks

  • What version of python are you using? Is pywebkitgtk properly installed? I must admit that I haven’t tested on windows…

    PS: perhaps the Qt version will yield better results, so have a look over this link: http://blog.motane.lu/2009/07/07/downloading-a-pages-content-with-python-and-webkit/ .

  • Oliver

    Thanks
    the Python version is 2.6.1
    At home I use Ubuntu and it works, at work I’ll install a virtual machine with linux.

    Thanks for your answer

  • I don’t use Windows so I don’t test my work in it.

  • Michael

    Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.

    When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.

  • Michael

    Thanks… that was really helpful. I was wondering if it is possible to make an ajax request from python. I mean once the page is loaded i want to goto another page by clicking on a link on in that page. The html code for the link is as follows.

    When i click on this link an ajax request is sent and it updates the box on the current page. Is it possible to send the request from python and get the response.

  • This may help people, the very first example does not work on Ubuntu 9.10. The GTK threads are not initialized.

    I found that by adding:

    gtk.gdk.threads_init()

  • Opp’s did not finish my post!

    Yes you need to add the gtk.gdk.threads_init() to initialize GTK threads. I added just after my last import.
    You also have to add the import statement to use threads.

    Hence:
    import threads

    Hope this helps some one…. 😉

  • Got a question. On the last example you don’t define the ‘sys’ object anywhere. So how did this work?

    Hence line:

    >print ‘You must specify an URL.’,sys.argv[0],’–help for more details’

    will cause a python error.

    NameError: global name ‘sys’ is not defined

    What is the proper syntax? As sys seems to have a list of arguments?

  • That got pasted wrong, I usually used it the right way, so it never reached the sys.args line. Python is an interpreted language, so it works like that.

    I didn’t test it on Koala yet, nor will I too soon, since that was a project for my former company. But it worked on Jaunty and on Debian (can’t remember the version).

    Added import sys and gtk.gdk.threads_init() in the code.

  • Tudor, you saved me a lot of work with this tutorial so I send you many thanks
    I have discovered that you can use the ‘console-message’ signal to read content
    from the html.

    New code:

    This works well if there are no other console.log messages from the source html.
    If there are, you can add some keys to your console message to recognize their yours.

    In self._finished_loading you can add thirt party javascript sources, run javascript commands though view.execute_script… then do view.execute_script(“console.log…”) to get the content back to the python script.

  • Definitely Use the main_resource from the main frame as you will see in the python api for webkit it is what holds on to the bulk from all the ajax type stuff running, only problem is I havent been able to render javascript and “dump” (as it were), an entire page from this api, It needs some rethinking to be a true crawler.
    but here is what is required to view your gmail inbox in pywebkitgtk,
    if you run the pywebkit browser demo

    when in your gmail inbox will dump all your mail from lists in javascript source
    Now If I could Just get the links extracted so I can navigate and recursively download my mail. Only because current provider blocks all SMTP.!!!

  • ryan

    it seems the code ‘s execution has nothing to do with python-webkitgtk, as I had not install it, but the code above can and do the work correctly~~

  • Jason

    It’d be great if anyone could write a Scrapy extension for this.

  • Ajijor

    Thanks !
    This is wicked.

    Interestingly, I get an exception for the import of threads.
    I am on Ubuntu10.10 with Python 2.6.6.
    I removed the import and stuff seems to be working ….

  • Sérgio Basto

    replace
    gtk.gdk.threads_init()
    to
    import gobject
    gobject.threads_init()

    works even better 🙂

  • Passerby

    Hi,

    Using selenium can be a solution as well

  • Sourabh Singi

    I have a problem installing the python-webkitgtk, I googled for the solution, but cudnt find much help.

    Anyone who have faced the same problem and solved it?

  • Ignacio

    hi.
    I have copy and paste your code, but
    I don’t know why:

  • There’s a problem with the indentation when you pasted the code. Just re-indent everything using either tabs either spaces, no combinations of the two.

  • Sourabh Singi> Can you please give more details on what distribution you are using?

  • Danilo

    I need help installing pywebkitgtk in windows…please =(

  • Sorry mate, I don’t do windows…

  • Rune K. Svendsen

    Here are packges I had to install in Ubuntu 12.04 to get it working (at least the test works):

  • Saksow

    I’ve tried all the pyqt and pygtk webkit scripts for js scraping but I have always the same problem: The script works as I want only the first time I run it, then it goes random… Someone tried this script multiple times on the same url?

  • Pingback: Python crawl ajax page uses pywebkit and threadpool?()

  • Pingback: Python crawl ajax page uses pywebkit and threadpool?()

  • Yi

    how to combine pywebkit with multi thread ?
    anyone know how to crawl pages simultaneously?

  • Giovanni

    Hi, i would implement your code with the possibility of read the file .js with actions… i ‘ve made this:

    so now i call -u URL -f FILE -j FILE.JS
    but this doesn’t work.
    need help

  • Giovanni

    my last problem is gone… now i have a new one.
    would i take the source code from a page without javascript are executed,??

  • Sergey

    Thank you so much for this artikle its just amazing!!!! I was looking for solution like this for the last ‘few’ days! Was looking at things like Selenium, python-spidermonkey, pyv8, PhantomJS, was trying to connect ELinks( Linux text browser) to the python and many other things! That’s just great!

  • For all windows/osx lovers .. you can use pyside(it also has webkit) instead of gtk, go here http://qt-project.org/wiki/PySideDownloads.

    For your task, I recommend using http://code.google.com/p/selenium/wiki/ChromeDriver and if you want it headless, you can use a frame buffer server (Xvfb)
    I am sure there is a way to inject script with some of these “selenium” bricks.

    Good luck!

  • The PySide link is down. Is it down only for me or has something happen to the site ? thanks

  • Pingback: 【已实现】想要通过python脚本实现抓取百度空间上的文章,评论,图片 v2011-12-20 | 在路上()

  • Tudor – Very useful post, could you also point to how this would work with pages that require an authentication(passing the inputs via a POST call to the form)
    I tried using BeautifulSoup,Selenium (with a lot of tweaks, which is how I landed on your webpage)and PyQt4 (Webkit)as in here -http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/. While webkit was pretty neat (See sample render example on that page), I kept getting errors while trying to “See”/print the html object that I saved, in short that did not solve my problem of redirection post “authetication” when using the Webkit’s “middleware”

    Question – With your module above, can I still plugin BeautifulSoup post the total render as from “_finished_loading” is finished? Would be great if you could me specific pointers/redirect me on this.
    The redirection logic, works seamlessly with Cookie cache(urlib2) and BeautifulSoup, I was hoping I could re-use that learning here, Can I ?

  • Hey,Tudor: The program code you provide doesn’t work for some urls, such as the site: http://www.csdn.net. I didn’t obtain the content which is generated by the JavaScript code after loding the page.Can you give some ideas? Many thanks.

  • Pingback: Problems adjusting | nerdy world problems()

  • zyong

    Hey Tudor: I want to use pywebkit to test page, I hope that it run without disturbing, I don’t need window to show, just run program to exec js and python, can it satisfied me to do it.

  • Amit

    Hey…. I am using Python 3.6 version and Pyside is not available for this version. Finding it difficult to install webkit and qt4 as well. Is there any alternative for the above example. I am in need of scraping data from flash.

    • Hui Jun

      Kinda late but you might want to set up a Python 2.7 virtual environment, or any other versions for that matter. That should allow you to install any legacy packages.

  • Amit

    I am using windows 10….

  • Tom Mori

    Can you rewrite it for Python 3? Otherwise, I may just do it. 🙂

    This would be much better than Selenium as the latter rarely works for me when I run it: Chromedriver or geckodriver does not want to load URL no matter how recent the browser and the servers are.

    • Tudor Barbu

      I personally don’t need it anymore. Please keep in mind that this post is from 2009 🙂

Advertisment ad adsense adlogger