Python Windows 和 Linux 下用谷歌 / 火狐无头浏览器爬取网页内容

Windows下

我个人比较推荐用火狐，因为谷歌在Linux下不好用，初始化的时候bug太多了

谷歌

谷歌无头浏览器下载地址：

http://npm.taobao.org/mirrors/chromedriver/

windows上的google一般都是64位的，而无头浏览器驱动是32位的，没问题，能用，版本对上就OK
下载解压之后把chromedriver.exe文件放到python27\Scripts
命令行里这样就说明能用啦！
file
爬取网页的基本操作：

#-*- encoding: utf-8 -*-
'''
demo1.py
Created on 2020-07-08 22:07
Copyright (c) 2020-07-08, 忘尘版权所有.
@author: 忘尘
'''
# coding: utf-8
from selenium import webdriver
from bs4 import BeautifulSoup

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')
client = webdriver.Chrome(chrome_options=chrome_options)
# 如果没有把chromedriver加入到PATH中,就需要指明路径 executable_path='/home/chromedriver'

client.get("https://news.e23.cn/wanxiang/index.html?spm=0.0.0.0.70XvmN")
# 获取页面所有内容
content = client.page_source
soup = BeautifulSoup(content,'lxml')
# 只取页面中的a标签内容
a_docs = soup.find_all('a')
file = open('html','a')
for a_doc in a_docs:
    print a_doc
    # 获取a标签中的href，其他属性也类似
    print a_doc['href']
    # 获取a标签的内容
    print a_doc.string
    file.write(a_doc.encode('utf-8'))
# 将页面内容变成一行
# content = content.replace('\r','').replace('\n','\002').encode('utf-8')
# print(content)
file.close()
client.quit()

file

火狐

火狐无头浏览器驱动下载地址：

https://github.com/mozilla/geckodriver

解压后和谷歌无头浏览器一样放到python27\Scripts下

from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.FirefoxOptions()
options.add_argument('-headless')
# Linux中：
# browser = webdriver.Firefox(executable_path="/usr/bin/geckodriver", options=options)
# windows中
browser = webdriver.Firefox(options=options)
browser.get("http://www.chinapeace.org.cn/gupiao/")
content = browser.page_source
soup = BeautifulSoup(content,'lxml')
a_docs = soup.find_all('a')
file = open('html.html','a')
for a_doc in a_docs:
    print a_doc
    print a_doc.get('href')
    print a_doc.string
    file.write(a_doc.encode('utf-8'))

file

Linux下

本来一开始想用谷歌无头浏览器的，结果运行的时候一堆bug，换成火狐之后一下子就好了
安装firefox:

yum install firefox

驱动下载地址
https://github.com/mozilla/geckodriver

解压后我放在了/usr/bin下，放这似乎不用指定路径，同时为其添加可执行属性

from selenium import webdriver
from bs4 import BeautifulSoup 
options = webdriver.FirefoxOptions()
options.add_argument('-headless')

# 不指定路径
browser = webdriver.Firefox(options=options)
# 指定路径，如果用上一句不行的话就用下面的指定下路径
# browser = webdriver.Firefox(executable_path="/usr/bin/geckodriver",options=options)

browser.get("http://www.chinapeace.org.cn/gupiao/")
content = browser.page_source
soup = BeautifulSoup(content,'lxml')
a_docs = soup.find_all('a')
file = open('html.html','a')
for a_doc in a_docs:
    print a_doc
    print a_doc.get('href')
    print a_doc.string
    file.write(a_doc.encode('utf-8'))

file

本帖已被设为精华帖！

Python Windows 和 Linux 下用谷歌 / 火狐无头浏览器爬取网页内容

谷歌

火狐

作者：忘尘

忘尘的其他话题

分类下其他主题

随机推荐话题

Python Windows 和 Linux 下用谷歌 / 火狐无头浏览器爬取网页内容

谷歌

火狐

添加附言

作者：忘尘

忘尘 的其他话题

分类下其他主题

随机推荐话题

忘尘的其他话题