/
Webscraping with Octoparse

Webscraping with Octoparse

Purpose: This process is used to extract data from a webpage. It utilizes a partially free (as of 2020) program that will allow for up to 10 groups of webscrapping processes (1,000 URLs) at a time. In this process, URLs are fed into the webscrapping program, which opens up the webpage and extracts text from the same location on each webpage and exports them to a spreadsheet formatted file.


Tools Needed:

Airtable

Google Analytics

Excel

Octoparse (Windows Machine)


Procedures:

  1. Set up Octoparse

    1. Sign up for a batch of URLS in the “URLS to process” tab of the Excel sheet

      1. Do not exceed 100 URLs at a time.  Octoparse struggles with more than that.

    2. Open Octoparse

    3. From the Advanced Mode dropdown, add a new task (delete old tasks, if you need to clear space.  Limit is 10 tasks at a time)

    4. Copy the 100 URLs into the Website box and click Save URL

    5. For each batch of 100, run the following web scrapes

  2. Collect Search Result Listing

    1. Begin each batch with the training URL (this URL shows title covers): http://discover.lib.usu.edu/iii/encore/plus/C__SAckermann%20human%20evolution__Orightresult__U?lang=eng&suite=cobalt

    2. Collect the titles

      1. In the image of the webpage at the bottom of the Octoparse tool, click on the title of the first results.

      2. In the Action Tips box to the right

        1. Click on “Select all”

        2. Note if the Link selected does not say that 25 similar URLs were found (if fewer, quickly scan to make sure all are highlighted after you click “Select all.”  Click on any unhighlighted titles and “select all” wherever possible.  This will happen most frequently with titles that over images of covers.  Look for 25 selected” to know when you have all titles selected.

      3. Click on “Extract text of the selected link”

    3. Collect the links

      1. In the image of the webpage at the bottom of the Octoparse tool, click on the title of the first results.

      2. In the Action Tips box to the right

        1. Click on “Select all

        2. Note if the Link selected does not say that 25 similar URLs were found (if fewer, quickly scan to make sure all are highlighted after you click “Select all.”  Click on any unhighlighted titles and “select all” wherever possible.  This will happen most frequently with titles that over images of covers.  Look for 25 selected” to know when you have all titles selected.)

      3. Click on “Extract the URL of the selected link”

    4. Collect the search terms used

      1. Click on the text box that contains the search terms OR click on the second half of the “Results” string where the search string is listed (not both, just one or the other) (Note: if you select the text box for the search terms, the data will be hidden in the data preview, but will still export just fine.)

        1. Select “Extract text”

    5. In the “Add predefined field” area, add two fields and move them to the top of the data preview in this order:

      1. Add Current Page Info -> Web page URL

      2. Current Time

    6. NOTE: You should have four fields at this point:

      1. Page_URL

      2. Current Time

      3. Field1_Text (titles)

      4. Field2_Link (urls)

      5. Field3_Text (search terms)

    7. Click on Save and Run

    8. Click on Local Extraction

    9. Click on Export Data

      1. Save file as “Encore_Batch#_Item#-#_Date”

    10. Click on Finish or the hyperlinked filename

  3. Number the results

    1. Open the Excel Sheet

    2. Delete all the results for Ackermann (because this was only used to calibrate Octoparse)

      1. Insert a column between the Current Time and the Title fields and call it Result #

      2. Add 1, 2, 3, etc. to indicate which position each URL held in the list (note that not all URLs will have 25 options but NONE should have more than 25.)

    3. Copy the data into Airtable (Batch X Search Results tab, “Adding Records” view)

      1. Page_URL = URL

      2. Current Time = Data Link Extracted

      3. Result # = Result #

      4. Field1_Text = Item Title

      5. Field2_Link = Item URL

    4. Record the Batch and ID numbers in the following fields:

      1. SID – Link to the Search ID from the URLS tab

        1. This is the most tedious part – you will need to be sure you are linking to the correct URL.  Searching by the Page URL will not work.  TIP: Open the Airtable base in two windows side by side.  One window set to “URLs” tab and the other to the appropriate batch being pulled (ex. “Batch 1 Search Results”).  Search the URLS tab for the Page URL, identify the URL ID (UID) and then use that UID in the SID column on the appropriate batch tab.

      2. Batch # - Link to the batch # in the Stats tab

        1. Copy and paste down.  All items on single tab will share the same Batch # - (i.e. Batch 1 Search Results = Batch-1  or COVIDBatch-1)

Related content

Pulling Web Logs from Google Analytics and webscraping with Octoparse for CONTENTdm Digital Collections
Pulling Web Logs from Google Analytics and webscraping with Octoparse for CONTENTdm Digital Collections
More like this
Pulling Web Logs from Google Analytics for EAD Guides
Pulling Web Logs from Google Analytics for EAD Guides
More like this
Coding Web Logs for CONTENTdm Digital Collections
Coding Web Logs for CONTENTdm Digital Collections
More like this