Basic BeatifulSoup Crawler

 

The media of this site were moved to the Amazon S3 bucket. Forestry implemented Amazon S3, but it doesn’t allow migration at this point. It was pain to replace all the image links to each file. Without dive into the S3 JavaScript SDK, a simple pyhton crawler can make the process easier.


##### Import library #####
from bs4 import BeautifulSoup


##### Prepare Data #####
html = """<paste html here>""" #Inspect and extract the table element in which all the object are contained in the S3 folder
soup = BeautifulSoup(html, "html.parser") # Parse the html to soup
imgs = soup.find_all("a",{'class':'list-view-item-name'}) # Find all a tags that has a specific class


##### Print Data #####
print('images:')
for img in imgs: # Print all the object in the assigned folder above
    print("- " + public_path + project_path + "/"+ img.text.strip())  # The objects needs to be publicly accessiable