เอกสาร Cheerio v1.2.0

Cheerio เป็น library ที่รวดเร็ว ยืดหยุ่น และสวยงามสำหรับการแยกวิเคราะห์และจัดการ HTML และ XML ในฝั่ง server ด้วยการใช้งาน subset ของฟังก์ชันหลักของ jQuery โดยให้ API ที่คุ้นเคยสำหรับนักพัฒนาในขณะเดียวกันก็ปรับให้เหมาะสมสำหรับสภาพแวดล้อมฝั่ง server

ฟีเจอร์ของเวอร์ชันปัจจุบัน (1.2.0)

วิธีการ Loading หลัก

เวอร์ชันปัจจุบันมีวิธีการที่ทรงพลังหลายแบบในการ load และแยกวิเคราะห์เอกสาร HTML/XML:

import * as cheerio from 'cheerio';

// Basic loading from string
const $ = cheerio.load('<h2 class="title">Hello world</h2>');

// Loading from buffer with encoding detection
const buffer = fs.readFileSync('index.html');
const $ = cheerio.loadBuffer(buffer);

// Loading from URL with automatic encoding detection
const $ = await cheerio.fromURL('https://example.com');

// Stream-based loading for large documents
const stream = cheerio.stringStream({}, (err, $) => {
  if (!err) {
    console.log($('h1').text());
  }
});

การรองรับ TypeScript ที่ปรับปรุงแล้ว

เวอร์ชัน 1.2.0 รวมคำจำกัดความ TypeScript ที่ครอบคลุมพร้วมความปลอดภัยของประเภทที่ดีขึ้น:

import { CheerioAPI, Cheerio, Element } from 'cheerio';

// Strongly typed element selection
const $: CheerioAPI = cheerio.load(html);
const elements: Cheerio<Element> = $('.my-class');

Extract API ใหม่

ฟีเจอร์ใหม่ที่ทรงพลังสำหรับการสกัดข้อมูลจากเอกสาร HTML:

const data = $root.extract({
  title: 'h1',
  links: [{ selector: 'a', value: 'href' }],
  metadata: {
    selector: '.meta',
    value: {
      author: '.author',
      date: '.date'
    }
  }
});

การจัดการ URL ขั้นสูง

การรองรับที่ปรับปรุงแล้วสำหรับการแก้ไข URL ด้วย baseURI:

const $ = cheerio.load(html, { 
  baseURI: 'https://example.com/page/' 
});

// Automatically resolves relative URLs
$('a').prop('href'); // Returns absolute URL
$('img').prop('src'); // Returns absolute URL

ฟีเจอร์หลักตามหมวดหมู่

การจัดการ DOM

การเลือก element แบบ jQuery ด้วย CSS selector
การจัดการ attribute และ property อย่างครอบคลุม
รองรับเต็มรูปแบบสำหรับ method การสำรวจ DOM
การแทรก ลบ และแก้ไข element

การจัดการ Form

// Serialize forms to URL-encoded strings
$('form').serialize(); // 'name=value&email=test@example.com'

// Get form data as structured arrays
$('form').serializeArray();
// [{ name: 'username', value: 'john' }, ...]

CSS และ Styling

// Get/set CSS properties
$('.element').css('color', 'red');
$('.element').css(['margin', 'padding']); // Get multiple properties

// Class manipulation
$('.item').addClass('active selected');
$('.item').removeClass('old-class');
$('.item').toggleClass('visible');

Data Attributes

// HTML5 data-* attribute support with automatic type coercion
$('.widget').data('config'); // Parses JSON automatically
$('.widget').data('count', 42); // Set data programmatically

การย้ายจากเวอร์ชันเก่า

จาก 0.x เป็น 1.x

Breaking: จำเป็นต้องใช้ Node.js 12+ (ยกเลิกการรองรับเวอร์ชันเก่า)
Breaking: API ภายในบางส่วนมีการเปลี่ยนแปลง
Improved: การรองรับ TypeScript ที่ดีขึ้นทั่วทั้งระบบ
New: การแยกวิเคราะห์แบบ stream เพื่อประสิทธิภาพหน่วยความจำที่ดีขึ้น

การเปลี่ยนแปลงที่ควรสังเกต

// Old way (0.x) - still works but discouraged
const cheerio = require('cheerio');
const $ = cheerio.load(html);

// New way (1.x) - recommended
import * as cheerio from 'cheerio';
const $ = cheerio.load(html);

การปรับปรุงประสิทธิภาพ

เวอร์ชัน 1.2.0 รวมการเพิ่มประสิทธิภาพอย่างมีนัยสำคัญ:

การแยกวิเคราะห์ที่เร็วขึ้น: parser HTML ที่ปรับปรุงแล้วพร้อมการจัดการข้อผิดพลาดที่ดีขึ้น
ประสิทธิภาพหน่วยความจำ: ลดการใช้หน่วยความจำสำหรับเอกสารขนาดใหญ่
การรองรับ streaming: ประมวลผลเอกสารขนาดใหญ่โดยไม่ต้อง load เข้าหน่วยความจำทั้งหมด

// Stream processing for large files
const stream = cheerio.decodeStream({
  encoding: { defaultEncoding: 'utf8' }
}, (err, $) => {
  // Process document as it streams
});

ฟีเจอร์ขั้นสูง

ตัวเลือก Parser แบบกำหนดเอง

const $ = cheerio.load(html, {
  xmlMode: true,        // Parse as XML
  decodeEntities: false, // Don't decode HTML entities
  scriptingEnabled: false // Disable script tag processing
});

การ Loading ผ่านเครือข่ายพร้อมตัวเลือก

const $ = await cheerio.fromURL('https://api.example.com/data', {
  requestOptions: {
    headers: { 'User-Agent': 'MyBot/1.0' }
  },
  encoding: { defaultEncoding: 'utf8' }
});

แนวปฏิบัติที่ดี

ใช้ TypeScript: ใช้ประโยชน์จากคำจำกัดความประเภทที่ครอบคลุม
Stream เอกสารขนาดใหญ่: ใช้ streaming API เพื่อการจัดการหน่วยความจำที่ดีขึ้น
ใช้ประโยชน์จาก extract API: ใช้ method extract ใหม่สำหรับการสกัดข้อมูลที่มีโครงสร้าง
ตั้งค่า baseURI: เมื่อ scraping เว็บไซต์ ให้ตั้งค่า baseURI เพื่อการแก้ไข URL ที่เหมาะสม

ความแตกต่างระหว่าง Browser กับ Server

Cheerio ได้รับการออกแบบมาเพื่อการใช้งานฝั่ง server และได้ลบฟีเจอร์ของ jQuery ที่เฉพาะเจาะจง browser ออกไป ในขณะเดียวกันก็เพิ่มฟังก์ชันที่ปรับให้เหมาะสำหรับ server เช่น:

ไม่มีความไม่สอดคล้องกันของ browser DOM
การจัดการข้อผิดพลาดที่ดีขึ้นสำหรับ HTML ที่มีรูปแบบไม่ถูกต้อง
การแยกวิเคราะห์ที่มีประสิทธิภาพหน่วยความจำ
ความสามารถในการประมวลผลแบบ stream
การตรวจจับการเข้ารหัสอัตโนมัติ

ทำให้ Cheerio เป็นตัวเลือกที่เหมาะสำหรับ web scraping, server-side rendering และงานประมวลผล HTML/XML ใดๆ ในสภาพแวดล้อม Node.js

← FAQ