Python采集今日头条文章内容保存html和txt(只要文本和图片去除script等无关样式)

2023年8月3日 09:39 ry 368

最近遇到个客户需求,也是老用户了,采集今日头条文章内容的,根据文章内容链接分别保存为txt和html,txt是只要标题和文字,html只要文字和图片,其余的都不要,话不多说,根据需求,先看采集需求,扫一眼没问题,再看保存txt和html需求是否可行,我们先拿一个文章来分析下,https://www.toutiao.com/article/7261139655872053760/?log_from=ebf54cd1999ed_1690700635571,就以这个为例,客户想要这样的效果

首先,我们来定位想要的元素,如图所示发现文章和图片都在article标签中,这就好办了,直接利用beautifulsoup获取article标签的所有内容,然后写入html即可,但是客户要求一些样式等也要删掉,这个好办,利用w3lib即可实现,我们看具体代码

import requests
from lxml import etree
from w3lib import html
from bs4 import BeautifulSoup
###
def get_url():
    with open('url.txt','r') as file:
        url_txt = [i.replace('\n','') for i in file.readlines()]

    return url_txt
###获取文首的数据
def get_first():
    with open('文首.txt','r') as file1:
        first_txt = file1.read()
    return first_txt
#获取文尾的数据
def get_last():
    with open('文末.txt','r') as file2:
        last_txt = file2.read()
    return last_txt
def get_content(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36',
        'cookie': '__ac_signature=_02B4Z6wo00f01qnEKuQAAIDBGTXGLitFe3Kp5C5AAMmy38; tt_webid=7190277221030151738; ttcid=e1d5ef5437314716827d09f21737a06a56; csrftoken=40b61363de2657e72481a9f8f81c0aef; _ga=GA1.1.202580012.1679885191; s_v_web_id=verify_lheongde_x2wf1hFh_s9wk_46My_BSzy_S4eDXDyyoxpA; local_city_cache=%E6%9D%AD%E5%B7%9E; _ga_QEHZPBE5HH=GS1.1.1683814921.8.1.1683816740.0.0.0; tt_scid=ozMsfquvjg-0JdqBW0xusogXYFZEnSAJ5IJPZkoLk0XYGBRwa-Ab6DegwTTU0SOj54ec; ttwid=1%7CwS8g-N7aA8D-IW5M8mzrHCauoWTwQNq8oOnOVXZ-E6w%7C1683816740%7Ca0ba880c2428087231c9b53da6d413437f268329c36c699ce1f214f9684054af; msToken=AzNumaBxvjVW0CM3hP91nBV3Az9E8SAMMqVpLpARWaPN9zvYVxu1susDE1d35bnIvnHyt1O1oahPlr75hAuAJfTnsjWW2he-ZJZvToUcDe0=',

    }
    r1 = requests.get(url, headers=headers).text
    soup = BeautifulSoup(r1,'html.parser')

    div_elements = soup.find_all('article')[0]


    html1 = etree.HTML(r1)
    title = ''.join(html1.xpath('//div[@class="article-content"]/h1/text()'))
    text = "<h1>" +title+"</h1>"+ html.remove_tags_with_content(str(div_elements), which_ones=("script", "style", "head"))

    txt_num = ''.join([i+'\n' for i in html1.xpath('//div[@class="article-content"]//p//text()')])
    return txt_num,title,text
if __name__ == '__main__':
    download_model = input('设置保存模式:(0:单txt,1:单html,2:两者):')
    urls = get_url()
    first_txt = get_first()
    last_txt = get_last()
    num = 0
    for url in urls:
        num += 1
        print('开始采集第{}个链接'.format(num))
        try:
            txt,title,text = get_content(url)

            if txt and title and text:
                print('标题:',title)
                if download_model == '0':
                    with open('./已下载/{}.txt'.format(title),'w',encoding='utf-8') as file4:
                        file4.write(url+'\n'+first_txt+'\n'+'在阅读文章前,麻烦您点下“关注”,方便您后续讨论和分享,感谢您的支持,我将每天陪伴你左右。\n'+txt+last_txt)
                elif download_model == '1':
                    with open('./html/{}.html'.format(title),'w',encoding='utf-8') as file5:
                        file5.write(text)
                else:
                    with open('./已下载/{}.txt'.format(title),'w',encoding='utf-8') as file4:
                        file4.write(url+'\n'+first_txt+'\n'+'在阅读文章前,麻烦您点下“关注”,方便您后续讨论和分享,感谢您的支持,我将每天陪伴你左右。\n'+txt+last_txt)
                    with open('./html/{}.html'.format(title),'w',encoding='utf-8') as file5:
                        file5.write(text)
            else:
                print('链接不存在,跳过')
        except:
            print('链接不存在,跳过')

    input("\n采集完毕,按任意键结束...............................")

搞定,txt是根据用户需求在原文的基础上家电东西哈,具体就不说了,100到手收工!!!

如果上述代码帮助您很多,可以打赏下以减少服务器的开支吗,万分感谢!

欢迎发表评论~

点击此处登录后即可评论


评论列表
暂时还没有任何评论哦...

赣ICP备2021001574号-1

赣公网安备 36092402000079号