Posted on:

19 Apr 2022

0

Scraping multiple pages in Stream

My goal is to scrape Action games' information, such as name of game, tags, prices. 
Used libraries are requests, beautifulsoup. 
URL : https://store.steampowered.com/tags/en/Action/#p=0&tab=ConcurrentUsers

I managed to code it up for the first page and then I tried to scrape 15 pages. My plan was that when I replace the "/Action/#p=0" with "/Action/#p=1" in the url and send a get request, I would receive the html response with the games from next page. For some reason this did not work as even if I try with "#p=15", I get the html for the first page. Then I inspected the page elements (1,2,3,4..) but they do not contain any links. Next, I started looking in "Inspect > Network tab" to check if I can intercept any link that resembles the html of the next page and I found it - upon inspection it did contain the games from the next page. URL for second page : https://store.steampowered.com/contenthub/querypaginated/tags/ConcurrentUsers/render/?query=&start=15&count=15&cc=BG&l=english&v=4&tag=Action&tagid=19

The page number 2 in the URL where the number is the "=&start" value/15. Unfortunately, the content is unusable as the hierarchies of the tags are messed up. For example:

           <span class="top_tag">
            FPS
           </span>
           <span class="top_tag">
            , Shooter
           </span>

Would be:

       <span class='\"top_tag\"'>
        FPS&lt;\/span&gt;
        <span class='\"top_tag\"'>
         , Shooter&lt;\/span&gt;

The second span class is the child of the first, where it should be its sibling. Both examples are decoded using prettify soup method with utf-8.

Is there a better way to do this? I am aware I can do it using regex or selenium, but I wonder if there is a way to do this task with beautifulsoup and requests.

0 answers ( 0 marked as helpful)

Submit an answer