Over the last few years, web scraping technologies have become very common as the appearance of mature headless browser automation tools, such as Puppeteer and Playwright, grow in popularity.
However, there are numerous ways in which the use of such automation can easily be detected: throttling requests is so common, and Google reCAPTCHA v3 is known to silently calculate a score to determine if the user is a real human or a robot. The reCAPTCHA way has been frequently cracked in recent years, you can find open-sourced cracking solutions fairly easily just by googling like 'puppeteer recaptcha solver'. The throttling way, however, is so hard to handle in most cases perhaps due to limited resources like IP addresses.
I'm sure many of the people who suffered this problem already have tried a lot of possible–something anyone can think of—solutions like using various cloud providers to distribute requests across them but eventually, apart from resources limit, after some time they get banned because IP address ranges of cloud providers are well-known and continuously updated in public.
So, I thought to myself, what if I could use lesser-known IP address ranges?
4G LTE on Linux
Using the cellular network, you can easily get actual, rotating, and decent IP addresses through which legitimate traffic is traversing, and this means the chances of blocking are relatively low.
To set up a cellular-based proxy, you can either use Personal Hotspot or you can get a USB-powered cellular modem. I chose the latter one because it's way more reliable, faster, and also because I'm using a cellular plan that I can share the monthly data allowance among up to 2 different devices.
Using the latter approach, I plugged the Sierra Wireless EM7565 cellular modem into the server and attached it to Linux VM. On the Linux side, I set up the modem using good old-fashioned network-manager (nmcli) and modem-manager (mmcli), and then I get a new network interface which is called wwan0.
$ lsusb
Bus 001 Device 003: ID 1199:9091 Sierra Wireless, Inc. Sierra Wireless EM7565 Qualcomm® Snapdragon™ X16 LTE-A
$ mmcli -b 0
------------------------
General | dbus path: /org/freedesktop/ModemManager1/Bearer/0
| type: default-attach
------------------------
Status | connected: yes
| suspended: no
| ip timeout: 20
------------------------
Properties | apn: lte.ktfwing.com
| ip type: ipv4v6
$ ip -4 a show dev wwan0
3: wwan0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
inet 10.79.188.186/30 brd 10.79.188.187 scope global noprefixroute wwan0
valid_lft forever preferred_lft forever
$ sudo speedtest -I wwan0
Speedtest by Ookla
Server: kdatacenter.com - Seoul (id = 6527)
ISP: Korea Telecom
Latency: 41.96 ms (11.61 ms jitter)
Download: 80.37 Mbps (data used: 88.2 MB)
Upload: 7.89 Mbps (data used: 14.1 MB)
Packet Loss: 0.0%
Now what's left is making a socks5-based forward proxy under the VM so that I can selectively route traffic through it. There are several widely known solutions. Dante, for example, is one of the most well-known approaches, but I failed to configure for the use case that I tried on the Using Multiple NICs on Linux post, I couldn't make it work to relay traffic between the internal interface and the external WWAN interface maybe because Linux kernel doesn't allow sending packets in the wrong–desired–direction.
After some struggling with socket, I ended up finding an alternative approach to Dante: https://github.com/nadoo/glider. Glider is written in Go, and I knew it's so easy to set socket's outgoing interface using net.Dialer
, which I tried when I was making a hobby project called unofficial-benchbee-speedtest, I checked Glider has already implemented using it. But unexpectedly, it didn't work on Ubuntu Linux 21.04, so I dug into the implementation of net.Dialer
, and found that I can manipulate the socket options using the net.Dialer.Control
function that allows accessing the actual open file descriptors so that I can explicitly select which network interface should be used for.
After that, I just opened up port 1088.
$ curl --proxy "socks5h://localhost:1088" https://ipapi.co/json
{
"ip": "110.70.15.253",
"version": "IPv4",
"city": "Seoul",
"region": "Seoul",
"region_code": "11",
"country": "KR",
"country_name": "South Korea",
"country_code": "KR",
"country_code_iso3": "KOR",
"country_capital": "Seoul",
"country_tld": ".kr",
"continent_code": "AS",
"in_eu": false,
"postal": "02878",
"latitude": 37.5944,
"longitude": 126.9864,
"timezone": "Asia/Seoul",
"utc_offset": "+0900",
"country_calling_code": "+82",
"currency": "KRW",
"currency_name": "Won",
"languages": "ko-KR,en",
"country_area": 98480.0,
"country_population": 51635256.0,
"asn": "AS4766",
"org": "Korea Telecom"
}
Cloudflare WARP on Linux
Cloudflare WARP is a VPN service that is provided by Cloudflare and served using WireGuard. iCloud Private Relay is known to use it. WARP is relatively new and less known, so I thought it appropriate to be used as a web scraping proxy before it became so well known. Just let me say one thing: I've been using Cloudflare Workers(not WARP) for scraping purposes for over a year, I've seen a fairly low error rate compared to the other cloud providers. (It gets worse as time goes on though.)
I'm not sure for WARP, but I guess WARP routes traffic to the closest Cloudflare POP–it sometimes routes to a POP located too far away due to the bandwidth costs though–, it rotates the same or similar IP address ranges while I trying to access within the same origin. Yes, this is a problem to be solved by chaining proxies like forwarding through multiple hops, but I think the existing local pool is good enough for usage so I decided to handle this later.
Fortunately, Cloudflare has recently released the official WARP client that supports socks5-based local proxy feature so what I had to do is just installing and activating it by warp-cli set-mode proxy
.
Now it works. Note that websites behind Cloudflare will get your real IP address. For example, ifconfig.io uses Cloudflare so it returns the real IP address instead of the edge node's IP.
$ curl -w "\n" --proxy "socks5://localhost:40000" https://api.ipify.org
8.26.182.104 #
$ curl -w "\n" --proxy "socks5://localhost:40000" https://ifconfig.io/ip
220.xxx.xxx.xxx # My IP Address
Well, still there is one more small problem, the bind address is set for localhost and cannot be changed. This I guess is because they don't want warp-cli to be used for a sharing proxy.
$ sudo lsof -i TCP:40000
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
warp-svc 5030 root 11u IPv4 817167 0t0 TCP localhost:40000 (LISTEN)
I solved this issue using socat.
$ cat /etc/systemd/system/warp-relay.service
[Unit]
Description=socat
After=network.target
[Service]
Type=simple
Restart=always
StandardOutput=syslog
StandardError=syslog
ExecStart=socat -d -d TCP4-LISTEN:2088,bind=192.168.100.199,reuseaddr,fork TCP4:127.0.0.1:40000
[Install]
WantedBy=multi-user.target
Finally, it works even remotely.
$ curl --proxy "socks5h://192.168.100.199:2088" https://api.ipify.org
8.26.182.104
awslambdaproxy
This is a very simple approach of using AWS infrastructure but a little bit unreliable way because of the well-known IP address ranges. Even so, I used awslambdaproxy just for fallback purposes.
HAProxy
Let's set up a proxy server that rotates among the socks5 proxies. This work significantly reduces the effort of hand-coding in the application programming process because HAProxy covers everything you need such as traffic splitting, circuit breaking, etc.
Luckily someone already created a socks5 solution for HAProxy:
Using the config above, I just encountered a problem that the tcp-check response from the warp proxy is not as expected. According to the RFC 1928, Section 6, it must reply with 05000001000000000000
, but I got 050000017f0000010000
checked using the code below.
import socket
import binascii
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('192.168.100.199', 2088))
# RFC 1928, Section 3
# > version(5) + nmethods(1) + methods(0)
s.send(bytes.fromhex('050100'))
print(binascii.hexlify(s.recv(2)))
# RFC 1928, Section 4
# > version(5) + cmd(1=connect) + rsv(0) + atyp(3=domainname) + dst.addr('google.com') + dst.port(80)
s.send(bytes.fromhex('050100030a676f6f676c652e636f6d0050'))
# RFC 1928, Section 6
print(binascii.hexlify(s.recv(10)))
s.close()
From the RFC document, however, I assumed the difference between the expected and the actual (7f00000
) is negligible because the REP field indicates success and that part BND.ADDR is a variable thing so I ended up just removing the trailing part of the expected binary.
listen socks
bind 0.0.0.0:33030
mode tcp
option tcp-check
tcp-check connect
tcp-check send-binary 050100
tcp-check expect binary 0500 # means local client working okay
tcp-check send-binary 050100030a676f6f676c652e636f6d0050 # try to acess google
#tcp-check expect binary 05000001000000000000
tcp-check expect binary 05000001
tcp-check send GET\ /generate_204\ HTTP/1.0\r\n
tcp-check send Host:\ google.com\r\n
tcp-check send User-Agent:\ curl/7.52.1\r\n
tcp-check send Accept:\ */*\r\n
tcp-check send \r\n
tcp-check expect rstring ^HTTP/1.[01]\ 204
balance leastconn
timeout server 600000
timeout client 600000
timeout connect 2000
server cellular-kt-1 192.168.100.199:1088 check inter 15s downinter 1m fall 4 weight 20
server wired-warp-1 192.168.100.199:2088 check inter 15s downinter 1m fall 4 weight 20
server wired-aws-1 glider-awslambdaproxy:1088 check inter 15s downinter 1m fall 4 weight 10
Now HAProxy is up and running.
And IP addresses are rotating automatically by weight for each request.
I still have a bunch of stuff I want to do, but first, it remains to be seen whether this fits my needs or not.
To sum up, the system design diagram is as follows: