說實在會搞到要寫爬蟲也是下下策,不是個方便的解法。對於用法我就不多做介紹了,工具玩法不是筆記中要傳達的~

(等等被說教壞人怎辦!?XD)

平時要寫爬蟲時,會先使用 CURL 命令列工具測試一次:

curl -b cookie -c cookie https://www.mxp.tw/login -d "account=admin" -d "password=123456" --referer "https://www.mxp.tw" -A "Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5" -i > test.html

或 PHP 版本:

<?php
$username = 'admin';
$password = '123456';

$path = tempnam(sys_get_temp_dir(), "mxp");

$url = "https://www.mxp.tw/login";
$postinfo = "account=" . $username . "&password=" . $password;
$cookie_file_path = $path . "-cookie.txt";

$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file_path);
curl_setopt($ch, CURLOPT_USERAGENT,
    "Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.12) Gecko/20050915 Firefox/1.0.7");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERER, 'https://www.mxp.tw');
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);
curl_exec($ch);

$html = "";
// 登入狀態下來個遞迴或是邏輯處理需求
for ($i = 0; $i < 10; $i++) {
    curl_setopt($ch, CURLOPT_URL, "https://www.mxp.tw/getuser/{$i}");
    $html .= curl_exec($ch);
}

curl_close($ch);

結果正確後,就會開始判斷是哪種用途,如果僅僅是以資料為導向的話,可能就寫個 NodeJS 腳本來跑,這時候會用到:

  1. Request
  2. CheerIO

上述,一個是包裝請求,一個是做內容分析取樣,用 JavaScript 有 CSS 選擇器 94 爽R~

而如果有必要包裝成一個服務,好攜帶與環境好重現的話就改用 PHP ,這時候又取決要在哪玩了, WordPress 上包裝請求直接使用內建「wp_remote_request」方法就很不錯,又如果要自己寫過請求端可以參考 CurlWrapper 這套件。

好,重點來了,那分析內容怎辦? 除了這款 HtmlPageDom 外好像沒什麼好輪子了,自幹的話一樣要使用 DOMDocument 載入 HTML 後再透過
[PHP] 爬蟲使用 DOMDocument 解析網站時 UTF-8 亂碼),哦對了,就算你是正規表示法的高手,我想你也不會想用正規表示法這樣幹(撈)的,效能問題很重要。

還有看到一個自幹幫手是這個方法:

<?php
function selector_to_xpath($selector) {
    // remove spaces around operators
    $selector = preg_replace('/s*>s*/', '>', $selector);
    $selector = preg_replace('/s*~s*/', '~', $selector);
    $selector = preg_replace('/s*+s*/', '+', $selector);
    $selector = preg_replace('/s*,s*/', ',', $selector);
    $selectors = preg_split('/s+(?![^[]+])/', $selector);
    foreach ($selectors as &$selector) {
        // ,
        $selector = preg_replace('/,/', '|descendant-or-self::', $selector);
        // input:checked, :disabled, etc.
        $selector = preg_replace('/(.+)?:(checked|disabled|required|autofocus)/', '1[@2="2"]', $selector);
        // input:autocomplete, :autocomplete
        $selector = preg_replace('/(.+)?:(autocomplete)/', '1[@2="on"]', $selector);
        // input:button, input:submit, etc.
        $selector = preg_replace('/:(text|password|checkbox|radio|button|submit|reset|file|hidden|image|datetime|datetime-local|date|month|time|week|number|range|email|url|search|tel|color)/', 'input[@type="1"]', $selector);
        // foo[id]
        $selector = preg_replace('/(w+)[([_w-]+[_wd-]*)]/', '1[@2]', $selector);
        // [id]
        $selector = preg_replace('/[([_w-]+[_wd-]*)]/', '*[@1]', $selector);
        // foo[id=foo]
        $selector = preg_replace('/[([_w-]+[_wd-]*)=['"]?(.*?)['"]?]/', '[@1="2"]', $selector);
        // [id=foo]
        $selector = preg_replace('/^[/', '*[', $selector);
        // div#foo
        $selector = preg_replace('/([_w-]+[_wd-]*)#([_w-]+[_wd-]*)/', '1[@id="2"]', $selector);
        // #foo
        $selector = preg_replace('/#([_w-]+[_wd-]*)/', '*[@id="1"]', $selector);
        // div.foo
        $selector = preg_replace('/([_w-]+[_wd-]*).([_w-]+[_wd-]*)/', '1[contains(concat(" ",@class," ")," 2 ")]', $selector);
        // .foo
        $selector = preg_replace('/.([_w-]+[_wd-]*)/', '*[contains(concat(" ",@class," ")," 1 ")]', $selector);
        // div:first-child
        $selector = preg_replace('/([_w-]+[_wd-]*):first-child/', '*/1[position()=1]', $selector);
        // div:last-child
        $selector = preg_replace('/([_w-]+[_wd-]*):last-child/', '*/1[position()=last()]', $selector);
        // :first-child
        $selector = str_replace(':first-child', '*/*[position()=1]', $selector);
        // :last-child
        $selector = str_replace(':last-child', '*/*[position()=last()]', $selector);
        // :nth-last-child
        $selector = preg_replace('/:nth-last-child((d+))/', '[position()=(last() - (1 - 1))]', $selector);
        // div:nth-child
        $selector = preg_replace('/([_w-]+[_wd-]*):nth-child((d+))/', '*/*[position()=2 and self::1]', $selector);
        // :nth-child
        $selector = preg_replace('/:nth-child((d+))/', '*/*[position()=1]', $selector);
        // :contains(Foo)
        $selector = preg_replace('/([_w-]+[_wd-]*):contains((.*?))/', '1[contains(string(.),"2")]', $selector);
        // >
        $selector = preg_replace('/>/', '/', $selector);
        // ~
        $selector = preg_replace('/~/', '/following-sibling::', $selector);
        // +
        $selector = preg_replace('/+([_w-]+[_wd-]*)/', '/following-sibling::1[position()=1]', $selector);
        $selector = str_replace(']*', ']', $selector);
        $selector = str_replace(']/*', ']', $selector);
    }
    // ' '
    $selector = implode('/descendant::', $selectors);
    $selector = 'descendant-or-self::' . $selector;
    // :scope
    $selector = preg_replace('/(((|)?descendant-or-self::):scope)/', '.3', $selector);
    // $element
    $sub_selectors = explode(',', $selector);
    foreach ($sub_selectors as $key => $sub_selector) {
        $parts = explode('$', $sub_selector);
        $sub_selector = array_shift($parts);
        if (count($parts) && preg_match_all('/((?:[^/]*/?/?)|$)/', $parts[0], $matches)) {
            $results = $matches[0];
            $results[] = str_repeat('/..', count($results) - 2);
            $sub_selector .= implode('', $results);
        }
        $sub_selectors[$key] = $sub_selector;
    }
    $selector = implode(',', $sub_selectors);
    return $selector;
}

幫你把要掃描的方法從 CSS 選擇器語法轉成 XPath ,還是可以稍稍喘口氣這樣XD

Facebook 外掛功能


Share:

作者: Chun

資訊愛好人士。主張「人人都該為了偷懶而進步」。期許自己成為斜槓到變進度條 100% 的年輕人。[//////////____30%_________]

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *