Java使用HTTPS抓取网页实现(Implementation of capturing web pages using HTTPS in Java)

如果网站不需要登录,直接抓取即可;如果网站需要登录,请参考上一篇文章:Java使用HTTPS登录网站代码实现,登录后,再抓取网页。

实现代码如下:

    /**
     * 抓取页面的子程序,返回HTML字符串
     * @param httpClient
     * @param pageNumber
     * @return
     * @throws Exception
     */
    private String grabPage(CloseableHttpClient httpClient, int pageNumber) throws Exception {
        HttpGet httpGet = new HttpGet(DETAIL_PAGE_PREFIX + "?id=" + pageNumber);
        httpGet.setHeader("User-Agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36");
        // 执行请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 接收结果
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity, "utf-8");
        // 关闭连接
        response.close();

        return html;
    }

上述代码传入的CloseableHttpClient为登录后的CloseableHttpClient,如果网站不需要登录,自己创建一个即可。比如:

CloseableHttpClient httpClient = HttpClients.createDefault();

本代码相关引用,及pom.xml需要做哪些修改,请参考上一篇文章:Java使用HTTPS登录网站代码实现

————————

If the website does not need to log in, you can grab it directly; If the website needs to log in, please refer to the previous article: Java uses HTTPS to log in to the website code. After logging in, grab the web page.

The implementation code is as follows:

    /**
     * 抓取页面的子程序,返回HTML字符串
     * @param httpClient
     * @param pageNumber
     * @return
     * @throws Exception
     */
    private String grabPage(CloseableHttpClient httpClient, int pageNumber) throws Exception {
        HttpGet httpGet = new HttpGet(DETAIL_PAGE_PREFIX + "?id=" + pageNumber);
        httpGet.setHeader("User-Agent",
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36");
        // 执行请求
        CloseableHttpResponse response = httpClient.execute(httpGet);
        // 接收结果
        HttpEntity entity = response.getEntity();
        String html = EntityUtils.toString(entity, "utf-8");
        // 关闭连接
        response.close();

        return html;
    }

The closeablehttpclient passed in by the above code is the closeablehttpclient after login. If the website does not need to log in, you can create one yourself. For example:

CloseableHttpClient httpClient = HttpClients.createDefault();

Relevant references in this code, and POM What changes need to be made to XML? Please refer to the previous article: Java uses HTTPS to log in to the website code