「Automating Proxy Rotation With Bash And Cron」の版間の差分

編集の要約なし
(ページの作成:「<br><br><br>To prevent IP blocking during web scraping, proxy rotation is essential.<br><br><br><br>You can achieve seamless proxy rotation without heavy frameworks by le…」)
 
 
1行目: 1行目:
<br><br><br>To prevent IP blocking during web scraping, proxy rotation is essential.<br><br><br><br>You can achieve seamless proxy rotation without heavy frameworks by leveraging bash and cron.<br><br><br><br>It avoids the complexity of third-party libraries, [https://hackmd.io/@3-ZW51qYR3KpuRcUae4AZA/4g-rotating-mobile-proxies-and-Proxy-farms hackmd] perfect for minimalistic environments or advanced users seeking granular control.<br><br><br><br>First, prepare a plain-text configuration file listing all available proxies.<br><br><br><br>Every proxy entry must follow the pattern: IP address, port, and optionally username and password for authenticated proxies.<br><br><br><br>Valid entries might look like 192.168.10.5:8080 or 192.168.1.20 9090 john doe.<br><br><br><br>Maintain the integrity of your proxy file by removing dead IPs and adding new working ones.<br><br><br><br>Next, write a simple bash script to select a random proxy from the list and update a configuration file or environment variable that your scraper uses.<br><br><br><br>It parses the proxy list, determines the total count, randomly selects a line, and outputs the result to current_proxy.txt.<br><br><br><br>Here’s a sample implementation:.<br><br><br><br>bash.<br><br><br><br>PROXY_FILE=.<br><br><br><br>to.<br><br><br><br>if [[! -f "$PROXY_FILE" ]]; then.<br><br><br><br>echo "Proxy list not found".<br><br><br><br>exit 1.<br><br><br><br>fi.<br><br><br><br>LINE_COUNT=$(wc -l <br><br> $LINE_COUNT -eq 0 ]]; then.<br><br><br><br>echo "No proxies available in the list".<br><br><br><br>exit 1.<br><br><br><br>fi.<br><br><br><br>RANDOM_LINE=$(((RANDOM % LINE_COUNT) + 1)).<br><br><br><br> tail -n 1 > "$OUTPUT_FILE".<br><br><br><br>Set execution permissions using chmod +x rotate_proxy.sh.<br><br><br><br>Run the script manually to verify it outputs exactly one proxy to current_proxy.txt.<br><br><br><br>Use cron to automate proxy rotation at your desired frequency.<br><br><br><br>rotate_proxy.sh.<br><br><br><br>This configuration rotates proxies hourly.<br><br><br><br>Tune the cron timing to match your scraping rate: 0,30    for half-hourly, or .<br><br><br><br>Be mindful not to rotate too frequently if your scraper runs continuously, as it may cause connection drops.<br><br><br><br>Your scraper must reload the proxy from current_proxy.txt on each request to stay updated.<br><br><br><br>Most tools like curl or wget can use the proxy by reading from a file or environment variable.<br><br><br><br>In your scraper, set proxy=$(cat .<br><br><br><br>Maintain proxy quality by automating checks and pruning non-responsive entries.<br><br><br><br>Enhance the script to test proxy connectivity and auto-remove failed entries.<br><br><br><br>proxy_rotation.log.<br><br><br><br>This method is simple, reliable, and scalable.<br><br><br><br>No heavy dependencies are needed—it runs on any Linux, macOS, or BSD system.<br><br><br><br>A simple bash script and a cron entry are all you need to sustain uninterrupted, undetected scraping.<br><br>
<br><br><br>Managing web scrapers or automated requests often requires rotating proxy servers to avoid IP bans and detection.<br><br><br><br>You can achieve seamless proxy rotation without heavy frameworks by leveraging bash and cron.<br><br><br><br>It avoids the complexity of third-party libraries, perfect for minimalistic environments or advanced users seeking granular control.<br><br><br><br>Begin with a simple text file containing your proxy endpoints.<br><br><br><br>Each line should contain one proxy in the format ipaddress port or ipaddress port username password if authentication is needed.<br><br><br><br>Your proxies.txt could include lines such as 10.0.0.5 3128 or 172.16.0.100 8080 admin secret.<br><br><br><br>Regularly refresh your proxy list to remove inactive or banned entries.<br><br><br><br>Create a bash script that picks a random proxy and writes it to a file consumed by your scraping tool.<br><br><br><br>The script will [https://hackmd.io/@3-ZW51qYR3KpuRcUae4AZA/4g-rotating-mobile-proxies-and-Proxy-farms read more on hackmd.io] the proxies.txt file, count the number of lines, pick one at random, and write it to a file called current_proxy.txt.<br><br><br><br>Below is a working script template:.<br><br><br><br>bin.<br><br><br><br>PROXY_FILE=.<br><br><br><br>tmp.<br><br><br><br>if [[! -f "$PROXY_FILE" ]]; then.<br><br><br><br>echo "Proxy list not found".<br><br><br><br>exit 1.<br><br><br><br>fi.<br><br><br><br>LINE_COUNT=$(wc -l .<br><br><br><br>if [[ $LINE_COUNT -eq 0 ]]; then.<br><br><br><br>echo "Proxy file contains no entries".<br><br><br><br>exit 1.<br><br><br><br>fi.<br><br><br><br>RANDOM_LINE=$((RANDOM % LINE_COUNT + 1)).<br><br><br><br>sed -n "PROXY_FILE" > "$OUTPUT_FILE".<br><br><br><br>Make the script executable with chmod +x rotate_proxy.sh.<br><br><br><br>Test it manually by running..<br><br><br><br>Configure a cron job to trigger the proxy rotation script periodically.<br><br><br><br>Edit your crontab via crontab -e and insert: 0    .<br><br><br><br>This will rotate the proxy every hour.<br><br><br><br>Tune the cron timing to match your scraping rate: 0,30    for half-hourly, or .<br><br><br><br>Be mindful not to rotate too frequently if your scraper runs continuously, as it may cause connection drops.<br><br><br><br>Ensure your scraping tool pulls the current proxy from current_proxy.txt before every HTTP call.<br><br><br><br>Most tools like curl or wget can use the proxy by reading from a file or environment variable.<br><br><br><br>to.<br><br><br><br>Continuously validate your proxies and remove those that fail to respond.<br><br><br><br>Modify the script to send a lightweight HTTP request to each proxy and purge those that time out.<br><br><br><br>rotation_log.txt to track changes over time.<br><br><br><br>This technique is lightweight, dependable, and easily expandable.<br><br><br><br>It avoids the overhead of complex proxy management libraries and works well on any Unixlike system.<br><br><br><br>You can achieve persistent, low-profile automation using only shell scripting and system scheduling—no external tools required.<br><br>
9

回編集