最近遇到个客户需求,也是老用户了,采集今日头条文章内容的,根据文章内容链接分别保存为txt和html,txt是只要标题和文字,html只要文字和图片,其余的都不要,话不多说,根据需求,先看采集需求,扫一眼没问题,再看保存txt和html需求是否可行,我们先拿一个文章来分析下,https://www.toutiao.com/article/7261139655872053760/?log_from=ebf54cd1999ed_1690700635571,就以这个为例,客户想要这样的效果
首先,我们来定位想要的元素,如图所示发现文章和图片都在article标签中,这就好办了,直接利用beautifulsoup获取article标签的所有内容,然后写入html即可,但是客户要求一些样式等也要删掉,这个好办,利用w3lib即可实现,我们看具体代码
import requests
from lxml import etree
from w3lib import html
from bs4 import BeautifulSoup
###
def get_url():
with open('url.txt','r') as file:
url_txt = [i.replace('\n','') for i in file.readlines()]
return url_txt
###获取文首的数据
def get_first():
with open('文首.txt','r') as file1:
first_txt = file1.read()
return first_txt
#获取文尾的数据
def get_last():
with open('文末.txt','r') as file2:
last_txt = file2.read()
return last_txt
def get_content(url):
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
'cookie': '__ac_signature=_02B4Z6wo00f01qnEKuQAAIDBGTXGLitFe3Kp5C5AAMmy38; tt_webid=7190277221030151738; ttcid=e1d5ef5437314716827d09f21737a06a56; csrftoken=40b61363de2657e72481a9f8f81c0aef; _ga=GA1.1.202580012.1679885191; s_v_web_id=verify_lheongde_x2wf1hFh_s9wk_46My_BSzy_S4eDXDyyoxpA; local_city_cache=%E6%9D%AD%E5%B7%9E; _ga_QEHZPBE5HH=GS1.1.1683814921.8.1.1683816740.0.0.0; tt_scid=ozMsfquvjg-0JdqBW0xusogXYFZEnSAJ5IJPZkoLk0XYGBRwa-Ab6DegwTTU0SOj54ec; ttwid=1%7CwS8g-N7aA8D-IW5M8mzrHCauoWTwQNq8oOnOVXZ-E6w%7C1683816740%7Ca0ba880c2428087231c9b53da6d413437f268329c36c699ce1f214f9684054af; msToken=AzNumaBxvjVW0CM3hP91nBV3Az9E8SAMMqVpLpARWaPN9zvYVxu1susDE1d35bnIvnHyt1O1oahPlr75hAuAJfTnsjWW2he-ZJZvToUcDe0=',
}
r1 = requests.get(url, headers=headers).text
soup = BeautifulSoup(r1,'html.parser')
div_elements = soup.find_all('article')[0]
html1 = etree.HTML(r1)
title = ''.join(html1.xpath('//div[@class="article-content"]/h1/text()'))
text = "<h1>" +title+"</h1>"+ html.remove_tags_with_content(str(div_elements), which_ones=("script", "style", "head"))
txt_num = ''.join([i+'\n' for i in html1.xpath('//div[@class="article-content"]//p//text()')])
return txt_num,title,text
if __name__ == '__main__':
download_model = input('设置保存模式:(0:单txt,1:单html,2:两者):')
urls = get_url()
first_txt = get_first()
last_txt = get_last()
num = 0
for url in urls:
num += 1
print('开始采集第{}个链接'.format(num))
try:
txt,title,text = get_content(url)
if txt and title and text:
print('标题:',title)
if download_model == '0':
with open('./已下载/{}.txt'.format(title),'w',encoding='utf-8') as file4:
file4.write(url+'\n'+first_txt+'\n'+'在阅读文章前,麻烦您点下“关注”,方便您后续讨论和分享,感谢您的支持,我将每天陪伴你左右。\n'+txt+last_txt)
elif download_model == '1':
with open('./html/{}.html'.format(title),'w',encoding='utf-8') as file5:
file5.write(text)
else:
with open('./已下载/{}.txt'.format(title),'w',encoding='utf-8') as file4:
file4.write(url+'\n'+first_txt+'\n'+'在阅读文章前,麻烦您点下“关注”,方便您后续讨论和分享,感谢您的支持,我将每天陪伴你左右。\n'+txt+last_txt)
with open('./html/{}.html'.format(title),'w',encoding='utf-8') as file5:
file5.write(text)
else:
print('链接不存在,跳过')
except:
print('链接不存在,跳过')
input("\n采集完毕,按任意键结束...............................")
搞定,txt是根据用户需求在原文的基础上家电东西哈,具体就不说了,100到手收工!!!
点击此处登录后即可评论