Over the last few years, web scraping technologies have become very common as the appearance of mature headless browser automation tools, such as Puppeteer and Playwright, grow in popularity.

However, there are numerous ways in which the use of such automation can easily be detected: throttling requests is so common, and Google reCAPTCHA v3 is known to silently calculate a score to determine if the user is a real human or a robot. The reCAPTCHA way has been frequently cracked in recent years, you can find open-sourced cracking solutions fairly easily just by googling like 'puppeteer recaptcha solver'. The throttling way, however, is so hard to handle in most cases perhaps due to limited resources like IP addresses.

I'm sure many of the people who suffered this problem already have tried a lot of possible–something anyone can think of—solutions like using various cloud providers to distribute requests across them but eventually, apart from resources limit, after some time they get banned because IP address ranges of cloud providers are well-known and continuously updated in public.

So, I thought to myself, what if I could use lesser-known IP address ranges?

4G LTE on Linux

Using the cellular network, you can easily get actual, rotating, and decent IP addresses through which legitimate traffic is traversing, and this means the chances of blocking are relatively low.

To set up a cellular-based proxy, you can either use Personal Hotspot or you can get a USB-powered cellular modem. I chose the latter one because it's way more reliable, faster, and also because I'm using a cellular plan that I can share the monthly data allowance among up to 2 different devices.

Using the latter approach, I plugged the Sierra Wireless EM7565 cellular modem into the server and attached it to Linux VM. On the Linux side, I set up the modem using good old-fashioned network-manager (nmcli) and modem-manager (mmcli), and then I get a new network interface which is called wwan0.

$ lsusb
Bus 001 Device 003: ID 1199:9091 Sierra Wireless, Inc. Sierra Wireless EM7565 Qualcomm® Snapdragon™ X16 LTE-A
$ mmcli -b 0
  ------------------------
  General    |  dbus path: /org/freedesktop/ModemManager1/Bearer/0
             |       type: default-attach
  ------------------------
  Status     |  connected: yes
             |  suspended: no
             | ip timeout: 20
  ------------------------
  Properties |        apn: lte.ktfwing.com
             |    ip type: ipv4v6
$ ip -4 a show dev wwan0
3: wwan0: <BROADCAST,MULTICAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UNKNOWN group default qlen 1000
    inet 10.79.188.186/30 brd 10.79.188.187 scope global noprefixroute wwan0
       valid_lft forever preferred_lft forever
$ sudo speedtest -I wwan0

   Speedtest by Ookla

     Server: kdatacenter.com - Seoul (id = 6527)
        ISP: Korea Telecom
    Latency:    41.96 ms   (11.61 ms jitter)
   Download:    80.37 Mbps (data used: 88.2 MB)
     Upload:     7.89 Mbps (data used: 14.1 MB)
Packet Loss:     0.0%
Sierra Wireless EM7565 (Qualcomm Snapdragon X16) Cellular Module

Now what's left is making a socks5-based forward proxy under the VM so that I can selectively route traffic through it. There are several widely known solutions. Dante, for example, is one of the most well-known approaches, but I failed to configure for the use case that I tried on the Using Multiple NICs on Linux post, I couldn't make it work to relay traffic between the internal interface and the external WWAN interface maybe because Linux kernel doesn't allow sending packets in the wrong–desired–direction.

After some struggling with socket, I ended up finding an alternative approach to Dante: https://github.com/nadoo/glider. Glider is written in Go, and I knew it's so easy to set socket's outgoing interface using net.Dialer, which I tried when I was making a hobby project called unofficial-benchbee-speedtest, I checked Glider has already implemented using it. But unexpectedly, it didn't work on Ubuntu Linux 21.04, so I dug into the implementation of net.Dialer, and found that I can manipulate the socket options using the net.Dialer.Control function that allows accessing the actual open file descriptors so that I can explicitly select which network interface should be used for.

diff --git a/proxy/direct.go b/proxy/direct.go
index cb96add..e2d45cf 100644
--- a/proxy/direct.go
+++ b/proxy/direct.go
@@ -4,6 +4,7 @@ import (
        "errors"
        "net"
        "time"
+       "syscall"

        "github.com/nadoo/glider/log"
 )
@@ -76,6 +77,14 @@ func (d *Direct) dial(network, addr string, localIP net.IP) (net.Conn, error) {
        }

        dialer := &net.Dialer{LocalAddr: la, Timeout: d.dialTimeout}
+       dialer.Control = func(network, address string, c syscall.RawConn) error {
+               return c.Control(func (fd uintptr)  {
+                       if err := syscall.BindToDevice(int(fd), "wwan0"); err != nil {
+                               log.F("%s", err)
+                       }
+               })
+       }
+

Hard-coded

After that, I just opened up port 1088.

$ curl --proxy "socks5h://localhost:1088" https://ipapi.co/json
{
    "ip": "110.70.15.253",
    "version": "IPv4",
    "city": "Seoul",
    "region": "Seoul",
    "region_code": "11",
    "country": "KR",
    "country_name": "South Korea",
    "country_code": "KR",
    "country_code_iso3": "KOR",
    "country_capital": "Seoul",
    "country_tld": ".kr",
    "continent_code": "AS",
    "in_eu": false,
    "postal": "02878",
    "latitude": 37.5944,
    "longitude": 126.9864,
    "timezone": "Asia/Seoul",
    "utc_offset": "+0900",
    "country_calling_code": "+82",
    "currency": "KRW",
    "currency_name": "Won",
    "languages": "ko-KR,en",
    "country_area": 98480.0,
    "country_population": 51635256.0,
    "asn": "AS4766",
    "org": "Korea Telecom"
}

Cloudflare WARP on Linux

Cloudflare WARP is a VPN service that is provided by Cloudflare and served using WireGuard. iCloud Private Relay is known to use it. WARP is relatively new and less known, so I thought it appropriate to be used as a web scraping proxy before it became so well known. Just let me say one thing: I've been using Cloudflare Workers(not WARP) for scraping purposes for over a year, I've seen a fairly low error rate compared to the other cloud providers. (It gets worse as time goes on though.)

I'm not sure for WARP, but I guess WARP routes traffic to the closest Cloudflare POP–it sometimes routes to a POP located too far away due to the bandwidth costs though–, it rotates the same or similar IP address ranges while I trying to access within the same origin. Yes, this is a problem to be solved by chaining proxies like forwarding through multiple hops, but I think the existing local pool is good enough for usage so I decided to handle this later.

Fortunately, Cloudflare has recently released the official WARP client that supports socks5-based local proxy feature so what I had to do is just installing and activating it by warp-cli set-mode proxy.

$ warp-cli settings
Always On: true
Mode: WarpProxy
Proxy listening on: 127.0.0.1:40000
$ curl --proxy "socks5://localhost:40000" https://1.1.1.1/cdn-cgi/trace
fl=12f798
h=1.1.1.1
ip=
ts=1628838022.016
visit_scheme=https
uag=curl/7.68.0
colo=LAX
http=http/2
loc=KR
tls=TLSv1.3
sni=off
warp=plus
gateway=off

WARP enabled

Now it works. Note that websites behind Cloudflare will get your real IP address. For example, ifconfig.io uses Cloudflare so it returns the real IP address instead of the edge node's IP.

$ curl -w "\n" --proxy "socks5://localhost:40000" https://api.ipify.org
8.26.182.104 #
$ curl -w "\n" --proxy "socks5://localhost:40000" https://ifconfig.io/ip
220.xxx.xxx.xxx # My IP Address

Well, still there is one more small problem, the bind address is set for localhost and cannot be changed. This I guess is because they don't want warp-cli to be used for a sharing proxy.

$ sudo lsof -i TCP:40000
COMMAND    PID USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
warp-svc  5030 root   11u  IPv4  817167      0t0  TCP localhost:40000 (LISTEN)

I solved this issue using socat.

$ cat /etc/systemd/system/warp-relay.service
[Unit]
Description=socat
After=network.target

[Service]
Type=simple
Restart=always
StandardOutput=syslog
StandardError=syslog
ExecStart=socat -d -d TCP4-LISTEN:2088,bind=192.168.100.199,reuseaddr,fork TCP4:127.0.0.1:40000

[Install]
WantedBy=multi-user.target

Finally, it works even remotely.

$ curl --proxy "socks5h://192.168.100.199:2088" https://api.ipify.org
8.26.182.104

awslambdaproxy

GitHub - dan-v/awslambdaproxy: An AWS Lambda powered HTTP/SOCKS web proxy
An AWS Lambda powered HTTP/SOCKS web proxy. Contribute to dan-v/awslambdaproxy development by creating an account on GitHub.

This is a very simple approach of using AWS infrastructure but a little bit unreliable way because of the well-known IP address ranges. Even so, I used awslambdaproxy just for fallback purposes.

HAProxy

Let's set up a proxy server that rotates among the socks5 proxies. This work significantly reduces the effort of hand-coding in the application programming process because HAProxy covers everything you need such as traffic splitting, circuit breaking, etc.

Luckily someone already created a socks5 solution for HAProxy:

[Haproxy cfg checking Socks5] Haproxy cfg to check the Socks5 connection #tags: GFW, network, haproxy, config
[Haproxy cfg checking Socks5] Haproxy cfg to check the Socks5 connection #tags: GFW, network, haproxy, config - haproxy.cfg

Using the config above, I just encountered a problem that the tcp-check response from the warp proxy is not as expected. According to the RFC 1928, Section 6, it must reply with 05000001000000000000, but I got 050000017f0000010000 checked using the code below.

import socket
import binascii

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('192.168.100.199', 2088))

# RFC 1928, Section 3
# > version(5) + nmethods(1) + methods(0)
s.send(bytes.fromhex('050100'))
print(binascii.hexlify(s.recv(2)))

# RFC 1928, Section 4
# > version(5) + cmd(1=connect) + rsv(0) + atyp(3=domainname) + dst.addr('google.com') + dst.port(80)
s.send(bytes.fromhex('050100030a676f6f676c652e636f6d0050'))
# RFC 1928, Section 6
print(binascii.hexlify(s.recv(10)))

s.close()

From the RFC document, however, I assumed the difference between the expected and the actual (7f00000) is negligible because the REP field indicates success and that part BND.ADDR is a variable thing so I ended up just removing the trailing part of the expected binary.

listen socks
  bind 0.0.0.0:33030
  mode tcp
  option tcp-check
  tcp-check connect
        tcp-check send-binary 050100
        tcp-check expect binary 0500 # means local client working okay
        tcp-check send-binary 050100030a676f6f676c652e636f6d0050 # try to acess google
        #tcp-check expect binary 05000001000000000000
        tcp-check expect binary 05000001
        tcp-check send GET\ /generate_204\ HTTP/1.0\r\n
        tcp-check send Host:\ google.com\r\n
        tcp-check send User-Agent:\ curl/7.52.1\r\n
        tcp-check send Accept:\ */*\r\n
        tcp-check send \r\n
        tcp-check expect rstring ^HTTP/1.[01]\ 204
  balance leastconn
        timeout server 600000
        timeout client 600000
        timeout connect 2000
        server cellular-kt-1 192.168.100.199:1088               check   inter 15s downinter 1m  fall 4  weight 20
        server wired-warp-1 192.168.100.199:2088                check   inter 15s downinter 1m  fall 4  weight 20
        server wired-aws-1 glider-awslambdaproxy:1088           check   inter 15s downinter 1m  fall 4  weight 10

Now HAProxy is up and running.

And IP addresses are rotating automatically by weight for each request.

I still have a bunch of stuff I want to do, but first, it remains to be seen whether this fits my needs or not.

To sum up, the system design diagram is as follows: