Webscraping with Octoparse
Purpose: This process is used to extract data from a webpage. It utilizes a partially free (as of 2020) program that will allow for up to 10 groups of webscrapping processes (1,000 URLs) at a time. In this process, URLs are fed into the webscrapping program, which opens up the webpage and extracts text from the same location on each webpage and exports them to a spreadsheet formatted file.
Tools Needed:
Airtable
Google Analytics
Excel
Octoparse (Windows Machine)
Procedures:
Set up Octoparse
Sign up for a batch of URLS in the “URLS to process” tab of the Excel sheet
Do not exceed 100 URLs at a time. Octoparse struggles with more than that.
Open Octoparse
From the Advanced Mode dropdown, add a new task (delete old tasks, if you need to clear space. Limit is 10 tasks at a time)
Copy the 100 URLs into the Website box and click Save URL
For each batch of 100, run the following web scrapes
Collect Search Result Listing
Begin each batch with the training URL (this URL shows title covers): http://discover.lib.usu.edu/iii/encore/plus/C__SAckermann%20human%20evolution__Orightresult__U?lang=eng&suite=cobalt
Collect the titles
In the image of the webpage at the bottom of the Octoparse tool, click on the title of the first results.
In the Action Tips box to the right
Click on “Select all”
Note if the Link selected does not say that 25 similar URLs were found (if fewer, quickly scan to make sure all are highlighted after you click “Select all.” Click on any unhighlighted titles and “select all” wherever possible. This will happen most frequently with titles that over images of covers. Look for 25 selected” to know when you have all titles selected.
Click on “Extract text of the selected link”
Collect the links
In the image of the webpage at the bottom of the Octoparse tool, click on the title of the first results.
In the Action Tips box to the right
Click on “Select all
Note if the Link selected does not say that 25 similar URLs were found (if fewer, quickly scan to make sure all are highlighted after you click “Select all.” Click on any unhighlighted titles and “select all” wherever possible. This will happen most frequently with titles that over images of covers. Look for 25 selected” to know when you have all titles selected.)
Click on “Extract the URL of the selected link”
Collect the search terms used
Click on the text box that contains the search terms OR click on the second half of the “Results” string where the search string is listed (not both, just one or the other) (Note: if you select the text box for the search terms, the data will be hidden in the data preview, but will still export just fine.)
Select “Extract text”
In the “Add predefined field” area, add two fields and move them to the top of the data preview in this order:
Add Current Page Info -> Web page URL
Current Time
NOTE: You should have four fields at this point:
Page_URL
Current Time
Field1_Text (titles)
Field2_Link (urls)
Field3_Text (search terms)
Click on Save and Run
Click on Local Extraction
Click on Export Data
Save file as “Encore_Batch#_Item#-#_Date”
Click on Finish or the hyperlinked filename
Number the results
Open the Excel Sheet
Delete all the results for Ackermann (because this was only used to calibrate Octoparse)
Insert a column between the Current Time and the Title fields and call it Result #
Add 1, 2, 3, etc. to indicate which position each URL held in the list (note that not all URLs will have 25 options but NONE should have more than 25.)
Copy the data into Airtable (Batch X Search Results tab, “Adding Records” view)
Page_URL = URL
Current Time = Data Link Extracted
Result # = Result #
Field1_Text = Item Title
Field2_Link = Item URL
Record the Batch and ID numbers in the following fields:
SID – Link to the Search ID from the URLS tab
This is the most tedious part – you will need to be sure you are linking to the correct URL. Searching by the Page URL will not work. TIP: Open the Airtable base in two windows side by side. One window set to “URLs” tab and the other to the appropriate batch being pulled (ex. “Batch 1 Search Results”). Search the URLS tab for the Page URL, identify the URL ID (UID) and then use that UID in the SID column on the appropriate batch tab.
Batch # - Link to the batch # in the Stats tab
Copy and paste down. All items on single tab will share the same Batch # - (i.e. Batch 1 Search Results = Batch-1 or COVIDBatch-1)