Here’s an example of a Python code snippet that demonstrates how to rotate proxies using the requests
library. This example incorporates best practices such as error handling and user-agent rotation to minimize the risk of detection while web scraping.
Example Code
import requests
import random
# List of proxies
proxies = [
{'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080'},
{'http': 'http://10.10.1.11:3128', 'https': 'http://10.10.1.11:1080'},
{'http': 'http://10.10.1.12:3128', 'https': 'http://10.10.1.12:1080'},
]
# List of user-agents
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15',
]
def get_page(url):
# Choose a random proxy
proxy = random.choice(proxies)
# Choose a random user-agent
user_agent = random.choice(user_agents)
headers = {'User-Agent': user_agent}
try:
response = requests.get(url, proxies=proxy, headers=headers, timeout=10) response.raise_for_status() # Raise HTTPError for bad responses (4xx or 5xx)
return response.text
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None
# Example usage
if __name__ == "__main__":
url = 'https://www.example.com'
content = get_page(url)
if content:
print("Successfully fetched the page.")
# Process the content here
# print(content)
else:
print("Failed to fetch the page.")
Explanation
- Proxy List: A list of dictionaries, where each dictionary specifies the HTTP and HTTPS proxies.
- User-Agent List: A list of user-agent strings to mimic different browsers.
get_page
Function:- Chooses a random proxy and user-agent for each request.
- Uses
requests.get
to fetch the content from the specified URL, including error handling. - Includes a timeout to prevent the script from hanging indefinitely.
- Raises an HTTPError for bad responses (4xx or 5xx status codes).
- Error Handling: The
try...except
block catches any request exceptions, such as connection errors or timeouts, and prints an error message.
Key Improvements and Best Practices
- Random Proxy Selection: Choosing a proxy randomly from the list ensures even distribution of requests across different proxies.
- User-Agent Rotation: Rotating user-agents helps to further disguise the scraper as a normal user.
- Error Handling: The
try...except
block ensures that the script handles any request-related errors gracefully, preventing it from crashing. - Timeout: Setting a timeout prevents the script from waiting indefinitely for a response from a slow or unresponsive server.
- Status Code Check: The
response.raise_for_status()
method raises an HTTPError for bad responses, allowing you to handle them appropriately.
This code provides a basic example of how to rotate proxies in Python using the requests
library. For more advanced use cases, you might consider using a more sophisticated proxy management library or service.