Serverless Web Scraping with AWS Lambda and Chrome
I recently came across the following problem for a hobby project which turned out much harder than I anticipated.
I wanted to scrape a webpage on a regular interval and store a specific number calculated from this webpage's contents as a timeseries for later analysis. Technically, this was a pretty simple problem and there are certainly a lot of resources out there about effective web scraping. But the complication arises not with scraping itself, but with the time dimension. This website changed a few times a day, and I was interested in how this value changed over time. Obviously, manually executing a program or setting up a cron job on my local computer would not work well if I was interested in reliably tracking this information over months and didn't want to check if my computer was running all the time.
The natural solution to this is to either use a VM instance on the cloud or a serverless functions. I wanted to use a serverless instance since they are free at a small scale and are immensely flexible. A managed VM instance would require more maintenance, more deployment overhead, and although it would be cheap, it would not be free. However actually porting a working local solution to work in a serverless environment was harder than I expected and the difficulty came in browser automation. The serverless environment needed to install chrome, include binary dependencies, actually support running such a heavy binary in addition to the function code, etc. I thought this would be a relatively simple problem to solve, but it turned out much more complicated than expected, so I wanted to share how to solve this problem.
Note that this solution is actually probably the last solution you should be interested in because there are several pre-packaged ways that solve the issue if the scenario applies. For node.js for example there is puppeteer and Python has a similar 3rd party package which allows you easily install different web browsers from within your script. This approach has its own impact on cold startups for serverless functions to watch out for but may be easily worth the hassle of the alternative. Lastly, there are even SaaS solutions like browerless.io which basically offer web browser management as a service and may be more appropriate for an enterprise solution or where the page being scraped has automation blocks (à la CloudFlare, CAPTCHA, etc.). So this would be applicable where the language environment does not have a pre-existing solution (imagine Rust or C++), where possible long and flaky startups would be an issue, or SaaS solutions should be avoided for some reason.
Create Serverless Environment with Chrome
Serverless functions are usually run on some pre-configured vanilla environment defined by the cloud provider. This environment probably has the language runtime and core binary dependencies (like some POSIX subset) but not much else. Most importantly, this environment does not include chrome, and therefore there needs to be control over the environment in which the function code is executed. AWS provides the ability to use a custom docker image for a function (at the time of writing Microsoft Azure does not), so a docker image that includes chrome and our function code needs to be created, the image needs to be hosted on AWS, and then the serverless function needs to point to that container. There are plenty of resources on the last two parts so I will focus on the first.
Of course, all the code that runs in the docker image creation could also be run in the function itself if you want to avoid docker or are stuck with a provider that doesn't support custom docker images for serverless functions, but overall, this seems like a bad idea™ that would need some downstream mitigations, like reserved instances and increased maximum function run time.
Dockerfile
AWS's latest configurable docker image is AL2023. This is a fedora like image but has a lot of things missing or swapped (like the yum package manager for example). The older AL2 seems to be have more features, but is therefore a bigger base image. I couldn't ever get that image to fit within the free limits of AWS image hosting, and there was another technical issue I could not resolve, so something in this script has to change for AL2.
FROM public.ecr.aws/lambda/provided:al2023
ARG CHROME_VERSION
# Download chrome. The wget and unzip commands will actually download and place the chrome binaries, while the remaining commands are about keeping the docker image as small as possible. Particularly the original zip files are deleted, language packs other than english are removed (since the UI is never really used), and wget and unzip are removed.
RUN dnf install -y unzip wget findutils && \
mkdir /downloads && \
wget -O /downloads/driver.zip https://storage.googleapis.com/chrome-for-testing-public/${CHROME_VERSION}/linux64/chromedriver-linux64.zip && \
wget -O /downloads/main.zip https://storage.googleapis.com/chrome-for-testing-public/${CHROME_VERSION}/linux64/chrome-linux64.zip && \
unzip /downloads/driver.zip -d /programs/ && \
unzip /downloads/main.zip -d /programs/ && \
rm -rf /downloads && \
readarray -d '' to_delete < <(find /programs/chrome-linux64/locales/ -type f ! -name 'en-US.pak' -print0) && \
rm "${to_delete[@]}" && \
dnf remove -y unzip wget findutils && \
dnf clean all
# Setup chrome to have the binary in the usual location. This can be overridden as necessary.
RUN ln -s /programs/chrome-linux64/chrome /usr/bin/google-chrome
# Install all of chromes dependencies. AL2023 is not able to install an RPM package directly so this is handled manually by the following.
RUN readarray -t deps < <(cat /programs/chrome-linux64/rpm.deps | grep -v "rpmlib\|^$") && \
dnf install -y "${deps[@]}" && \
dnf clean allThis should basically create the smallest docker image that can run on AWS and which can run Chrome.
The binaries in the docker image come from chrome for testing. Historically, the normal consumer version of chrome was used for automation but this ran into issues with requirements that the chromedriver be the exact same version as chrome. Since the consumer version will auto-update in the background as there are new releases, it easily resulted in strange errors. The chrome for testing versions are meant for automation and come in several flavors depending on the exact platform.
Chrome Command Line Arguments
Running a vanilla chrome.exe in an AWS lambda will likely crash. I found the following arguments necessary to avoid issues.
--no-sandbox--disable-dev-shm-usage--no-zygote--single-process--headless
Avoiding Detection
Additionally, I have found the following process helpful to bypass certain automation controls (mostly cloudflare checks). Note that bypassing these checks in general can be quite complicated, so while these worked for my specific automation scenario, other reasonable scenarios may need a slightly different set of configurations to properly work. In either case, these are likely to be a good starting point.
In fact, most of the value add from services like
browerless.ioprobably come from solving this specific issue. It's a difficult problem but one that comes up frequently and solving it is never really the end goal of any automation scenario.
These additional flags should be added.
--disable-blink-features=AutomationControlled--disable-infobars
The excludeSwitches option on the Selenium driver should include enable-automation, and the useAutomationExtension option should be false.
Finally, when chrome is driven by automation the navigator object can sometimes have a value for navigator.webdriver, which can be a giveaway of automation. To handle this, execute the following JS via the driver library, Object.defineProperty(navigator, 'webdriver', {get: () => undefined}).
User Agent Changes
Additionally, by using the --headless flag chrome's user-agent string will use "HeadlessChrome" rather than just "Chrome", which some sites will automatically reject. In this case, you can change the user-agent to something static by using the --user-agent switch. Simply open up a Chrome DevTools console and find the value of navigator.userAgent locally and use that.
In my case I expanded a bit on this because I wanted to be able to easily update the Chrome version by changing only one config value in my Docker build command and a static UA string within my code doesn't easily allow that. Additionally UA strings are complicated and just changing the chrome version may create an invalid UA stirng. Since the only issue with the UA string was that it included the string "Headless", it had to be dynamically set to remove that. Pseudo-code to do this via CDP (Chrome DevTools Protocol) which should be accessible via any decent web browser driving library is shown below:
let old_user_agent = execute_js("return navigator.userAgent;")
let user_agent = user_agent.replace("HeadlessChrome", "Chrome")
execute_cdp("Emulation.setUserAgentOverride", {"userAgent": user_agent})Conclusion
Hopefully that should unblock your automation scenario and allow you to use the flexibility of serverless functions with the power of chrome. This is a small guide that just reflects what I was able to gather in my own experience, so it may not be complete, but please leave comments or questions and we can fill in any gaps discovered on the way.
