首页 > 文章列表 > PHP新闻采集终极指南:高效抓取列表&详情,轻松解决路径和内容难题!

PHP新闻采集终极指南:高效抓取列表&详情,轻松解决路径和内容难题!

407 2025-03-21

如何用PHP高效采集新闻列表及详情,并解决相对路径和内容提取难题?

利用PHP高效采集新闻列表及详情,巧妙解决相对路径和内容提取难题

本文将详细讲解如何运用PHP、cURL和正则表达式,从目标网站高效采集新闻列表和新闻详情,并重点解决相对路径和内容提取的常见问题。

首先,我们需要获取新闻列表页面的HTML内容。可以使用cURL函数,将目标网站的列表页URL(例如:http://www.example.com/news)作为参数传入,即可获取页面源代码。

挑战一:处理相对路径的新闻链接

列表页的新闻链接通常是相对路径,例如/article/40958.html。为了得到完整的URL,需要将网站域名与相对路径拼接。虽然可以使用正则表达式href="(.+?)"提取链接,但这仅获取相对路径。我们需要在PHP中进行进一步处理:

$html = curl_exec($ch); // 获取页面HTML内容

preg_match_all('/标签中的链接
$news_links = [];
$base_url = 'http://www.example.com'; //  替换为目标网站域名

foreach ($matches[1] as $relative_path) {
    $absolute_url = $base_url . ($relative_path[0] == '/' ? $relative_path : '/' . $relative_path); // 拼接完整URL
    $news_links[] = $absolute_url;
}
```

这段代码首先使用正则表达式提取所有``标签中的链接,然后遍历提取的相对路径,将其与网站域名拼接成完整的URL,存储到`$news_links`数组中。


**挑战二:高效提取新闻详情页内容**

接下来,遍历`$news_links`数组中的每个链接,再次使用cURL获取新闻详情页的HTML内容,并提取新闻标题和内容。

```php
foreach ($news_links as $news_link) {
    curl_setopt($ch, CURLOPT_URL, $news_link);
    $detail_html = curl_exec($ch);

    // 提取新闻标题 (需要根据实际HTML结构调整正则表达式)
    preg_match('/(.*?)</title>/i', $detail_html, $title_match);
    $title = isset($title_match[1]) ? $title_match[1] : '';

    // 提取新闻内容 (需要根据实际HTML结构调整正则表达式)
    preg_match('/<div class="content text-xs">(.+?)</div>/is', $detail_html, $content_match);
    $content = isset($content_match[1]) ? strip_tags($content_match[1]) : ''; // 使用strip_tags清除HTML标签

    // 输出新闻标题、完整路径和内容
    echo "标题: " . $title . "<br>";
    echo "链接: " . $news_link . "<br>";
    echo "内容: " . $content . "<br><br>";
}</pre>
<p>这段代码遍历每个新闻链接,获取详情页HTML,使用正则表达式提取内容和标题(注意:标题和内容的正则表达式需要根据实际HTML结构调整)。<code>strip_tags()</code>函数用于清除HTML标签,得到纯文本内容。最后,将提取的信息输出。  <code>/is</code>修饰符中的<code>s</code>允许<code>.</code>匹配换行符。</p>
<p><strong>总结:</strong></p>
<p>本方法使用cURL和正则表达式实现了新闻采集。  然而,对于复杂的HTML结构,建议使用DOM解析器(例如DOMDocument)以提高代码的可读性和维护性,并减少正则表达式出错的风险。  请务必根据目标网站的HTML结构调整正则表达式。  记住替换<code>'http://www.example.com'</code>为实际的目标网站域名。</p> 
                      <div style="float: right;overflow: hidden;height: 20px;line-height: 16px;font-size: 14px;color:gray;">
                        来源:<a href="1741622073" class="aBlack" title="1741622073" target="_blank" rel="nofollow">1741622073</a>
                      </div>
                                    </div>
                <div class="ngfSypBox">
                                        <a href="https://www.jiaoben.net/article/315004.html" class="aBlue" title="">上一篇 哔哩哔哩大会员发票,教你轻松搞定!
</a>
                                        <a href="https://www.jiaoben.net/article/315007.html" class="aBlue" title="">下一篇 高德地图店铺名称修改教程:快速改名方法详解
</a>
                                       
                    <div class="clear"></div>
                </div>
              </div>
          </div>
        
          <div class="ngfXlhBox">
            <div class="ngfTjzxIn">
                <div class="ngfTjzxTitle">
                    <b></b>
                    <h2>本类最新</h2>
                    <span>
                        <a href="https://www.jiaoben.net/articlelist/17_1.html" title="查看更多" target="_blank" class="bBlack">
                            <i>查看更多</i>
                            <em></em>
                            <div class="clear"></div>
                        </a>
                    </span>  
                  <div class="clear"></div>
                </div>
                <ul class="ngfSpjcList">
                                    <li>
                    <a href="https://www.jiaoben.net/article/302799.html" class="aBlack" target="_blank" title="ThinkPHP saveAll和create方法新增数据无需显式判断成功原因是什么?">
                      <span><img data-src="/uploads/20250419/174506669268039ac4d549c.jpg" alt="ThinkPHP saveAll和create方法新增数据无需显式判断成功原因是什么?"  class="lazyload" src="/uploads/20250419/174506669268039ac4d549c.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>ThinkPHP saveAll和create方法新增数据无需显式判断成功原因是什么?</p>
                    </a>
                  </li>
                                    <li>
                    <a href="https://www.jiaoben.net/article/299945.html" class="aBlack" target="_blank" title="Mac OS系统PHP-FPM 502错误:如何解决子进程因SIGSEGV信号异常退出问题?">
                      <span><img data-src="/uploads/20250419/174502709468030016bcc1c.jpg" alt="Mac OS系统PHP-FPM 502错误:如何解决子进程因SIGSEGV信号异常退出问题?"  class="lazyload" src="/uploads/20250419/174502709468030016bcc1c.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>Mac OS系统PHP-FPM 502错误:如何解决子进程因SIGSEGV信号异常退出问题?</p>
                    </a>
                  </li>
                                    <li>
                    <a href="https://www.jiaoben.net/article/312140.html" class="aBlack" target="_blank" title="告别枯燥的图标:Blade Heroicons 让 Laravel 开发更便捷">
                      <span><img data-src="/uploads/20250419/17450264936802fdbd152de.jpg" alt="告别枯燥的图标:Blade Heroicons 让 Laravel 开发更便捷"  class="lazyload" src="/uploads/20250419/17450264936802fdbd152de.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>告别枯燥的图标:Blade Heroicons 让 Laravel 开发更便捷</p>
                    </a>
                  </li>
                                    <li>
                    <a href="https://www.jiaoben.net/article/299606.html" class="aBlack" target="_blank" title="ThinkPHP5缓存写入失败:Windows服务器权限问题如何解决?">
                      <span><img data-src="/uploads/20250419/17450264936802fdbd8b6fd.jpg" alt="ThinkPHP5缓存写入失败:Windows服务器权限问题如何解决?"  class="lazyload" src="/uploads/20250419/17450264936802fdbd8b6fd.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>ThinkPHP5缓存写入失败:Windows服务器权限问题如何解决?</p>
                    </a>
                  </li>
                                    <li>
                    <a href="https://www.jiaoben.net/article/301571.html" class="aBlack" target="_blank" title="PHP正则表达式中++运算符的含义和用法是什么?">
                      <span><img data-src="/uploads/20250419/17450252946802f90e010fc.jpg" alt="PHP正则表达式中++运算符的含义和用法是什么?"  class="lazyload" src="/uploads/20250419/17450252946802f90e010fc.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>PHP正则表达式中++运算符的含义和用法是什么?</p>
                    </a>
                  </li>
                                    <li>
                    <a href="https://www.jiaoben.net/article/299701.html" class="aBlack" target="_blank" title="AppServer下PHPMyAdmin无法登录?详解解决方案">
                      <span><img data-src="/uploads/20250419/17450246936802f6b555f06.jpg" alt="AppServer下PHPMyAdmin无法登录?详解解决方案"  class="lazyload" src="/uploads/20250419/17450246936802f6b555f06.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                      <p>AppServer下PHPMyAdmin无法登录?详解解决方案</p>
                    </a>
                  </li>
                                    <div class="clear"></div>
                </ul>
            </div>
          </div>
       
        </div>
   
        <div class="ngfRight">
          
          <div class="ngfJxydBox">
            <div class="ngfJxydIn">
              <div class="ngfTjzxTitle">
                <b></b>
                <h2>热门推荐</h2> 
                <span>
                    <a href="https://www.jiaoben.net/articlelist/17_1.html" target="_blank" title="查看更多" class="bBlack">
                        <i>查看更多</i>
                        <em></em>
                        <div class="clear"></div>
                    </a>
                </span>  
              <div class="clear"></div>
              </div>
              <ul class="ngfJxydList">
                                <li>
                  <span><a href="https://www.jiaoben.net/article/175629.html" target="_blank" title="PHP框架如何使用 PHPStorm"><img data-src="/uploads/20250326/174297902767e3bfd39c160.jpg" alt="PHP框架如何使用 PHPStorm"  class="lazyload" src="/uploads/20250326/174297902767e3bfd39c160.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/175629.html" target="_blank" title="PHP框架如何使用 PHPStorm" class="aBlack">PHP框架如何使用 PHPStorm</a></dt>
                    <dd>
                      <em><b class="icon1"></b>501</em>
                      <em><b class="icon2"></b>2025-03-26</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                                <li>
                  <span><a href="https://www.jiaoben.net/article/312164.html" target="_blank" title="告别字符串处理难题:使用 Composer 简化 PHP 开发"><img data-src="/uploads/20250406/174390178767f1d45b4c5c6.jpg" alt="告别字符串处理难题:使用 Composer 简化 PHP 开发"  class="lazyload" src="/uploads/20250406/174390178767f1d45b4c5c6.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/312164.html" target="_blank" title="告别字符串处理难题:使用 Composer 简化 PHP 开发" class="aBlack">告别字符串处理难题:使用 Composer 简化 PHP 开发</a></dt>
                    <dd>
                      <em><b class="icon1"></b>501</em>
                      <em><b class="icon2"></b>2025-04-06</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                                <li>
                  <span><a href="https://www.jiaoben.net/article/257154.html" target="_blank" title="阿里云服务器SVN安装失败提示“bash: svnadmin: command not found”怎么办?"><img data-src="/uploads/20250413/174450240767fafe87b4033.jpg" alt="阿里云服务器SVN安装失败提示“bash: svnadmin: command not found”怎么办?"  class="lazyload" src="/uploads/20250413/174450240767fafe87b4033.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/257154.html" target="_blank" title="阿里云服务器SVN安装失败提示“bash: svnadmin: command not found”怎么办?" class="aBlack">阿里云服务器SVN安装失败提示“bash: svnadmin: command not found”怎么办?</a></dt>
                    <dd>
                      <em><b class="icon1"></b>501</em>
                      <em><b class="icon2"></b>2025-04-13</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                                <li>
                  <span><a href="https://www.jiaoben.net/article/226848.html" target="_blank" title="php函数异常处理的典型错误和对应策略"><img data-src="/uploads/20250402/174360449367ed4b0d7056b.jpg" alt="php函数异常处理的典型错误和对应策略"  class="lazyload" src="/uploads/20250402/174360449367ed4b0d7056b.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/226848.html" target="_blank" title="php函数异常处理的典型错误和对应策略" class="aBlack">php函数异常处理的典型错误和对应策略</a></dt>
                    <dd>
                      <em><b class="icon1"></b>500</em>
                      <em><b class="icon2"></b>2025-04-02</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                                <li>
                  <span><a href="https://www.jiaoben.net/article/303186.html" target="_blank" title="PHP SOAP请求:如何使用SoapClient发送和接收数据?"><img data-src="/uploads/20250404/174373830167ef55bdc46c5.jpg" alt="PHP SOAP请求:如何使用SoapClient发送和接收数据?"  class="lazyload" src="/uploads/20250404/174373830167ef55bdc46c5.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/303186.html" target="_blank" title="PHP SOAP请求:如何使用SoapClient发送和接收数据?" class="aBlack">PHP SOAP请求:如何使用SoapClient发送和接收数据?</a></dt>
                    <dd>
                      <em><b class="icon1"></b>500</em>
                      <em><b class="icon2"></b>2025-04-04</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                                <li>
                  <span><a href="https://www.jiaoben.net/article/307182.html" target="_blank" title="MySQL字符串存储转义:如何避免特殊字符被自动转换为HTML实体?"><img data-src="/uploads/20250324/174279300067e0e928d8363.jpg" alt="MySQL字符串存储转义:如何避免特殊字符被自动转换为HTML实体?"  class="lazyload" src="/uploads/20250324/174279300067e0e928d8363.jpg" onerror="this.src='/statics/moren/120_120.png'"/></a></span>
                  <dl>
                    <dt><a href="https://www.jiaoben.net/article/307182.html" target="_blank" title="MySQL字符串存储转义:如何避免特殊字符被自动转换为HTML实体?" class="aBlack">MySQL字符串存储转义:如何避免特殊字符被自动转换为HTML实体?</a></dt>
                    <dd>
                      <em><b class="icon1"></b>499</em>
                      <em><b class="icon2"></b>2025-03-24</em>
                      <div class="clear"></div>
                    </dd>
                  </dl>
                  <div class="clear"></div>
                </li>  
                              </ul>
            </div>
          </div>       
          <div class="ngfJxydBox">
            <div class="ngfJxydIn">
              <div class="ngfTjzxTitle">
                <b></b>
                <h2>热门教程</h2>
                <span>
                    <a href="https://www.jiaoben.net/articlelist" title="查看更多" class="bBlack">
                        <i>查看更多</i>
                        <em></em>
                        <div class="clear"></div>
                    </a>
                </span>  
              <div class="clear"></div>
              </div>
              <ul class="ngfRmspList">
                                <li>
                  <a href="https://www.jiaoben.net/article/281859.html" title="2025年快递停运时间查询" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250402/174355379167ec84ff3ad50.jpg" alt="2025年快递停运时间查询"  class="lazyload" src="/uploads/20250402/174355379167ec84ff3ad50.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>2025年快递停运时间查询</p>
                  </a>
                </li>
                                <li>
                  <a href="https://www.jiaoben.net/article/257553.html" title="ANTLR加减乘除表达式语法识别报错:如何正确定义整数匹配模式?" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250331/174342959567ea9fdb9c121.jpg" alt="ANTLR加减乘除表达式语法识别报错:如何正确定义整数匹配模式?"  class="lazyload" src="/uploads/20250331/174342959567ea9fdb9c121.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>ANTLR加减乘除表达式语法识别报错:如何正确定义整数匹配模式?</p>
                  </a>
                </li>
                                <li>
                  <a href="https://www.jiaoben.net/article/245766.html" title="Java 中 HashMap 的底层数据结构是什么?" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250405/174384088867f0e67875f49.jpg" alt="Java 中 HashMap 的底层数据结构是什么?"  class="lazyload" src="/uploads/20250405/174384088867f0e67875f49.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>Java 中 HashMap 的底层数据结构是什么?</p>
                  </a>
                </li>
                                <li>
                  <a href="https://www.jiaoben.net/article/216066.html" title="网易云音乐怎么设置禁用流量 网易云音乐设置禁用流量方法" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250327/174307560367e5391369f7d.jpg" alt="网易云音乐怎么设置禁用流量 网易云音乐设置禁用流量方法"  class="lazyload" src="/uploads/20250327/174307560367e5391369f7d.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>网易云音乐怎么设置禁用流量 网易云音乐设置禁用流量方法</p>
                  </a>
                </li>
                                <li>
                  <a href="https://www.jiaoben.net/article/175629.html" title="PHP框架如何使用 PHPStorm" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250326/174297902767e3bfd39c160.jpg" alt="PHP框架如何使用 PHPStorm"  class="lazyload" src="/uploads/20250326/174297902767e3bfd39c160.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>PHP框架如何使用 PHPStorm</p>
                  </a>
                </li>
                                <li>
                  <a href="https://www.jiaoben.net/article/305936.html" title="GORM关联模型字段:指针类型和值类型在预加载时的区别是什么?" target="_blank" class="aBlack">
                    <span><img data-src="/uploads/20250322/174265350867dec84401ad2.jpg" alt="GORM关联模型字段:指针类型和值类型在预加载时的区别是什么?"  class="lazyload" src="/uploads/20250322/174265350867dec84401ad2.jpg" onerror="this.src='/statics/www/2020images/moren/355_225.png'"/></span>
                    <p>GORM关联模型字段:指针类型和值类型在预加载时的区别是什么?</p>
                  </a>
                </li>
                                <div class="clear"></div>
              </ul>
            </div>
          </div>
          <!--热门应用 end-->
        </div>
        <!--右边 end-->
        <div class="clear"></div>
   </div>
 
   <div class="sy_footer">
    <footer>
        <div class="footer_cont">
            <div class="footcont_left">
                <a href="#">
                    <img style="width: 300px;" src="https://www.jiaoben.net/statics/www/2020images/footer-logo.png" alt="">
                </a>
            </div>
            <em class="fot_line"></em>
            <div class="footcont_right">
                <p>版权所有:jiaoben.net Copyright 2020~2025</p>
                <p>本站所有软件、源码、文章均有网友提供,如有侵权联系jiaobennet@163.com</p>
                <p><a style="color: #999999" href="https://beian.miit.gov.cn/" target="_blank">湘ICP备2022002427号</a></p>
            </div>
            <div class="clear"></div>
        </div>
    </footer>
</div>
   <!--右侧导航-->
   <div class="ngfYoucBox" id="menuRight">
    <a href="https://m.jiaoben.net" class="aBlue">
      <b class="icon1" style="float: none;"></b>
      <p>手机版</p>
    </a>
    <a href="javascript:scroll(0,0)" class="cBlack">
    <b class="icon2"></b>
    <p>返回顶部</p>
  </a>
  </div>
  <!--右侧导航 end-->
<script>
    var _hmt = _hmt || [];
    (function() {
        var hm = document.createElement("script");
        hm.src = "https://hm.baidu.com/hm.js?d7dfe7acaf9cb2eaf7c319afb7d1c287";
        var s = document.getElementsByTagName("script")[0];
        s.parentNode.insertBefore(hm, s);
    })();
</script>
   <div class="ngfZuocBox" id="menuLeft">
        <a href="https://www.jiaoben.net/articlelist/31_1.html" title="软件教程">软件教程</a>
        <a href="https://www.jiaoben.net/articlelist/30_1.html" title="数据库">数据库</a>
        <a href="https://www.jiaoben.net/articlelist/29_1.html" title="linux">linux</a>
        <a href="https://www.jiaoben.net/articlelist/28_1.html" title="网络安全">网络安全</a>
        <a href="https://www.jiaoben.net/articlelist/25_1.html" title="MySql">MySql</a>
        <a href="https://www.jiaoben.net/articlelist/21_1.html" title="HTML+CSS">HTML+CSS</a>
        <a href="https://www.jiaoben.net/articlelist/20_1.html" title="JavaScript">JavaScript</a>
        <a href="https://www.jiaoben.net/articlelist/19_1.html" title="C++">C++</a>
        <a href="https://www.jiaoben.net/articlelist/18_1.html" title="goLang">goLang</a>
        <a href="https://www.jiaoben.net/articlelist/17_1.html" title="php">php</a>
        <a href="https://www.jiaoben.net/articlelist/16_1.html" title="Python">Python</a>
        <a href="https://www.jiaoben.net/articlelist/15_1.html" title="java">java</a>
    </div>
   
</body>
</html>