Scraping JavaScript: Chromium won't install
Hi there
I'm currently taking the web scraping course and I'm having a lot of fun. The lectures are easy to follow and everything is explained nicely. I'm almost done with the course. Only scraping HTML with JavaScript is left. However, I've run into a little problem. I'd appreciate your help.
Course: Web Scraping and API Fundamentals in Python
Section: The requests-html package
Video / lecture: Scraping JavaScript
From minute 2:20 you explain how the code 'await r.html.arender()' should install the chromium browser before running the JavaScript on the page we're working on. However, chromium won't install here. This is the error I get:
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes. ---------------------------------------------------------------------------
Error Traceback (most recent call last)
~\anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname)
484 try:
--> 485 cnx.do_handshake()
486 except OpenSSL.SSL.WantReadError: ... ~\anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
434
435 if new_retry.is_exhausted():
--> 436 raise MaxRetryError(_pool, url, error or ResponseError(cause))
437
438 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry) MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) Any advice on how I should proceed further? :( Thanks in advance and thank you for the great course!
Download may take a few minutes. ---------------------------------------------------------------------------
Error Traceback (most recent call last)
~\anaconda3\lib\site-packages\urllib3\contrib\pyopenssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname)
484 try:
--> 485 cnx.do_handshake()
486 except OpenSSL.SSL.WantReadError: ... ~\anaconda3\lib\site-packages\urllib3\util\retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
434
435 if new_retry.is_exhausted():
--> 436 raise MaxRetryError(_pool, url, error or ResponseError(cause))
437
438 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry) MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) Any advice on how I should proceed further? :( Thanks in advance and thank you for the great course!
6 answers ( 0 marked as helpful)
Hi Joey,
You can try to downgrade the 'urllib3' package to 1.25.8:
pip install urllib3==1.25.8
Best,
365 TeamHi Nikola
Thanks for your suggestion. Unfortunately, it didn't work. I got the same error.
Here's what I did:
1. I ran 'pip install urllib3==1.25.8' in Jupyter.
2. I restarted the kernel and ran all lines anew.
3. I ran 'pip show urllib3' to check whether the version was downgraded correctly. I got a positive response, namely:
Name: urllib3
Version: 1.25.8
Summary: HTTP library with thread-safe connection pooling, file post, and more.
Home-page: https://urllib3.readthedocs.io/
Author: Andrey Petrov
Author-email: andrey.petrov@shazow.net
License: MIT
Location: c:\users\dimit\anaconda3\lib\site-packages
Requires:
Required-by: requests, pyppeteer
Note: you may need to restart the kernel to use updated packages. 4. I ran 'await r.html.arender()', but got the same error as before, namely: MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) I'm really frustrated, because I'm almost done with the course and this little thing is preventing me from finishing it entirely. Please help.
Version: 1.25.8
Summary: HTTP library with thread-safe connection pooling, file post, and more.
Home-page: https://urllib3.readthedocs.io/
Author: Andrey Petrov
Author-email: andrey.petrov@shazow.net
License: MIT
Location: c:\users\dimit\anaconda3\lib\site-packages
Requires:
Required-by: requests, pyppeteer
Note: you may need to restart the kernel to use updated packages. 4. I ran 'await r.html.arender()', but got the same error as before, namely: MaxRetryError: HTTPSConnectionPool(host='storage.googleapis.com', port=443): Max retries exceeded with url: /chromium-browser-snapshots/Win_x64/575458/chrome-win32.zip (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])"))) I'm really frustrated, because I'm almost done with the course and this little thing is preventing me from finishing it entirely. Please help.
Never mind. Found a solution which worked out great. Here's the link:
https://github.com/miyakogi/pyppeteer/issues/258
I am glad you managed to fix the issue!
Thank you for sharing the solution as well!
Best,
365 Team
Now I have another problem. x-(
I'm working on the exercise 'Scraping YouTube' from the same 'Web Scraping and API Fundamentals in Python' course. Here's my code so far:
from requests_html import AsyncHTMLSession #Load requests-html package
session = AsyncHTMLSession() #Start a session
base_site = 'https://www.youtube.com'
r = await session.get(base_site) #Send a GET request await r.html.arender() #Render the JS code on the website However, I can't go on because i get the following error:
r = await session.get(base_site) #Send a GET request await r.html.arender() #Render the JS code on the website However, I can't go on because i get the following error:
RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
I tried rerunning the code and the error changed to:
Future exception was never retrieved future: <Future finished exception=NetworkError('Protocol error Target.detachFromTarget: Target closed.')> pyppeteer.errors.NetworkError: Protocol error Target.detachFromTarget: Target closed. Future exception was never retrieved future: <Future finished exception=NetworkError('Protocol error (Target.sendMessageToTarget): No session with given id')> pyppeteer.errors.NetworkError: Protocol error (Target.sendMessageToTarget): No session with given idI can't get the code to run without an error. Please help.
Too bad the guys from the team weren't able to help. Anyway, I found a solution to my second problem (see the link below). The method described in the link is different from the one in the lecture, but it got the job done. Now I'm done with the exercise and the course. Scraping is a lot of fun. :)
https://towardsdatascience.com/data-science-skills-web-scraping-javascript-using-python-97a29738353f