包括的なCheerioチュートリアルガイド
ガイド1: CheerioでHTMLテーブルからデータを抽出する方法
Webスクレイピングでテーブルデータを抽出することは、HTMLドキュメントを扱う際の最も一般的なタスクの一つです。Cheerioを使うことで、テーブル構造を解析し、意味のあるデータを抽出することが非常に簡単になります。このガイドでは、Cheerioを使用してHTMLテーブルからデータを抽出するための様々なテクニックをご紹介します。
環境のセットアップ
まず、プロジェクトにCheerioをインストールします:
npm install cheerio
基本的なテーブル構造の理解
抽出テクニックに入る前に、典型的なHTMLテーブル構造を理解しましょう:
<table id="products">
<thead>
<tr>
<th>Product Name</th>
<th>Price</th>
<th>Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>Laptop</td>
<td>$999</td>
<td>15</td>
</tr>
<tr>
<td>Mouse</td>
<td>$25</td>
<td>50</td>
</tr>
</tbody>
</table>
ステップ1: HTMLの読み込みと基本的なテーブル選択
CheerioでHTMLコンテンツを読み込むことから始めます:
import * as cheerio from 'cheerio';
const html = `
<table id="products">
<thead>
<tr>
<th>Product Name</th>
<th>Price</th>
<th>Stock</th>
</tr>
</thead>
<tbody>
<tr>
<td>Laptop</td>
<td>$999</td>
<td>15</td>
</tr>
<tr>
<td>Mouse</td>
<td>$25</td>
<td>50</td>
</tr>
</tbody>
</table>
`;
const $ = cheerio.load(html);
ステップ2: テーブルヘッダーの抽出
テーブルヘッダーはデータの構造を提供します。まずそれらを抽出しましょう:
const headers: string[] = [];
$('#products thead th').each((index, element) => {
headers.push($(element).text().trim());
});
console.log('Headers:', headers);
// Output: ['Product Name', 'Price', 'Stock']
ステップ3: 行データの抽出
次に実際のテーブルデータを抽出します。ニーズに応じていくつかのアプローチがあります:
方法1: 行ごとの抽出
const rows: string[][] = [];
$('#products tbody tr').each((index, row) => {
const rowData: string[] = [];
$(row).find('td').each((cellIndex, cell) => {
rowData.push($(cell).text().trim());
});
rows.push(rowData);
});
console.log('Rows:', rows);
// Output: [['Laptop', '$999', '15'], ['Mouse', '$25', '50']]
方法2: 構造化オブジェクトの作成
interface Product {
productName: string;
price: string;
stock: number;
}
const products: Product[] = [];
$('#products tbody tr').each((index, row) => {
const cells = $(row).find('td');
const product: Product = {
productName: $(cells[0]).text().trim(),
price: $(cells[1]).text().trim(),
stock: parseInt($(cells[2]).text().trim())
};
products.push(product);
});
console.log('Products:', products);
ステップ4: 複雑なテーブルの処理
実際のテーブルでは、セルの結合、ネストされた要素、特殊な書式設定がある場合があります。それらの処理方法は以下の通りです:
const complexHtml = `
<table class="sales-data">
<tr>
<td rowspan="2">Q1</td>
<td>January</td>
<td>$<span class="amount">15000</span></td>
<td class="status success">✓</td>
</tr>
<tr>
<td>February</td>
<td>$<span class="amount">18000</span></td>
<td class="status pending">⏳</td>
</tr>
</table>
`;
const $complex = cheerio.load(complexHtml);
const salesData = [];
$complex('.sales-data tr').each((index, row) => {
const cells = $complex(row).find('td');
const record = {
quarter: cells.length === 4 ? $complex(cells[0]).text() : 'Q1', // Handle rowspan
month: $complex(cells[cells.length === 4 ? 1 : 0]).text(),
amount: $complex(cells[cells.length === 4 ? 2 : 1]).find('.amount').text(),
status: $complex(cells[cells.length === 4 ? 3 : 2]).attr('class')?.includes('success') ? 'Complete' : 'Pending'
};
salesData.push(record);
});
ステップ5: 高度なフィルタリングとデータ処理
Cheerioの強力なセレクタを使用してデータをフィルタリングし、処理します:
// Extract only rows with specific conditions
const highStockProducts = [];
$('#products tbody tr').each((index, row) => {
const stock = parseInt($(row).find('td:nth-child(3)').text());
if (stock > 20) {
highStockProducts.push({
name: $(row).find('td:first-child').text(),
stock: stock
});
}
});
// Extract data using attribute selectors
$('table[data-category="electronics"] tbody tr').each((index, row) => {
// Process electronics category tables only
});
完全な動作例
テーブルデータ抽出を実演する包括的な例をご紹介します:
import * as cheerio from 'cheerio';
// Sample HTML with a product catalog table
const html = `
<!DOCTYPE html>
<html>
<body>
<table id="product-catalog" class="data-table">
<thead>
<tr>
<th>ID</th>
<th>Product Name</th>
<th>Category</th>
<th>Price</th>
<th>Stock</th>
<th>Rating</th>
</tr>
</thead>
<tbody>
<tr data-id="1">
<td>001</td>
<td><a href="/laptop-pro">Laptop Pro</a></td>
<td>Electronics</td>
<td>$<span class="price">1299</span></td>
<td class="stock-high">25</td>
<td><span class="rating" data-score="4.5">★★★★☆</span></td>
</tr>
<tr data-id="2">
<td>002</td>
<td><a href="/wireless-mouse">Wireless Mouse</a></td>
<td>Electronics</td>
<td>$<span class="price">35</span></td>
<td class="stock-medium">12</td>
<td><span class="rating" data-score="4.2">★★★★☆</span></td>
</tr>
<tr data-id="3">
<td>003</td>
<td><a href="/desk-lamp">LED Desk Lamp</a></td>
<td>Office</td>
<td>$<span class="price">89</span></td>
<td class="stock-low">3</td>
<td><span class="rating" data-score="4.7">★★★★★</span></td>
</tr>
</tbody>
</table>
</body>
</html>
`;
interface Product {
id: string;
name: string;
category: string;
price: number;
stock: number;
rating: number;
url: string;
stockLevel: 'low' | 'medium' | 'high';
}
function extractTableData(html: string): Product[] {
const $ = cheerio.load(html);
const products: Product[] = [];
$('#product-catalog tbody tr').each((index, row) => {
const $row = $(row);
// Extract basic data
const id = $row.find('td:nth-child(1)').text().trim();
const nameElement = $row.find('td:nth-child(2) a');
const name = nameElement.text().trim();
const url = nameElement.attr('href') || '';
const category = $row.find('td:nth-child(3)').text().trim();
// Extract and parse price
const priceText = $row.find('.price').text();
const price = parseFloat(priceText);
// Extract stock with level detection
const stockCell = $row.find('td:nth-child(5)');
const stock = parseInt(stockCell.text().trim());
let stockLevel: 'low' | 'medium' | 'high' = 'medium';
if (stockCell.hasClass('stock-low')) stockLevel = 'low';
else if (stockCell.hasClass('stock-high')) stockLevel = 'high';
// Extract rating
const rating = parseFloat($row.find('.rating').attr('data-score') || '0');
products.push({
id,
name,
category,
price,
stock,
rating,
url,
stockLevel
});
});
return products;
}
// Usage
const products = extractTableData(html);
console.log('Extracted Products:', JSON.stringify(products, null, 2));
// Additional processing examples
const highRatedProducts = products.filter(p => p.rating >= 4.5);
const lowStockProducts = products.filter(p => p.stockLevel === 'low');
const expensiveProducts = products.filter(p => p.price > 100);
console.log(`Found ${highRatedProducts.length} high-rated products`);
console.log(`Found ${lowStockProducts.length} low-stock products`);
console.log(`Found ${expensiveProducts.length} expensive products`);
この包括的なアプローチは複雑なテーブル構造を処理し、様々なデータ型を抽出し、Cheerioを使ったあらゆるテーブルスクレイピングタスクの基盤を提供します。
ガイド2: Cheerioでフォームデータと入力値をスクレイピングする方法
フォームはWebページの重要なコンポーネントで、ユーザー入力、選択、設定などの貴重なデータが含まれています。このガイドでは、Cheerioの強力なフォーム操作機能を使用して、フォームデータ、入力値、フォーム構造を抽出する方法を実演します。
フォーム要素の理解
フォームには、異なる抽出アプローチを必要とする様々な入力タイプが含まれています:
<form id="user-registration">
<input type="text" name="username" value="john_doe">
<input type="email" name="email" value="john@example.com">
<input type="password" name="password" value="secret123">
<input type="checkbox" name="newsletter" checked>
<select name="country">
<option value="us" selected>United States</option>
<option value="ca">Canada</option>
</select>
<textarea name="bio">Software developer...</textarea>
</form>
ステップ1: フォームデータ抽出のセットアップ
HTMLを読み込み、フォーム構造を理解することから始めます:
import * as cheerio from 'cheerio';
const formHtml = `
<form id="contact-form" method="post" action="/submit">
<div class="form-group">
<label for="name">Full Name:</label>
<input type="text" id="name" name="fullName" value="Jane Smith" required>
</div>
<div class="form-group">
<label for="email">Email:</label>
<input type="email" id="email" name="email" value="jane@company.com">
</div>
<div class="form-group">
<label>Gender:</label>
<input type="radio" name="gender" value="male" id="male">
<label for="male">Male</label>
<input type="radio" name="gender" value="female" id="female" checked>
<label for="female">Female</label>
</div>
<div class="form-group">
<label for="interests">Interests:</label>
<input type="checkbox" name="interests" value="coding" checked>
<label>Coding</label>
<input type="checkbox" name="interests" value="design">
<label>Design</label>
<input type="checkbox" name="interests" value="marketing" checked>
<label>Marketing</label>
</div>
<div class="form-group">
<label for="country">Country:</label>
<select name="country" id="country">
<option value="">Select Country</option>
<option value="us" selected>United States</option>
<option value="uk">United Kingdom</option>
<option value="ca">Canada</option>
</select>
</div>
<div class="form-group">
<label for="message">Message:</label>
<textarea name="message" id="message" rows="4">Hello, I'm interested in your services...</textarea>
</div>
<button type="submit">Submit</button>
</form>
`;
const $ = cheerio.load(formHtml);
ステップ2: テキスト入力値の抽出
テキスト入力、emailフィールド、パスワードはval()メソッドを使用して抽出できます:
// Extract individual input values
const fullName = $('input[name="fullName"]').val();
const email = $('input[name="email"]').val();
console.log('Full Name:', fullName); // "Jane Smith"
console.log('Email:', email); // "jane@company.com"
// Extract all text-type inputs at once
const textInputs: Record<string, string> = {};
$('input[type="text"], input[type="email"], input[type="password"]').each((index, element) => {
const $input = $(element);
const name = $input.attr('name');
const value = $input.val() as string;
if (name) {
textInputs[name] = value || '';
}
});
console.log('Text Inputs:', textInputs);
ステップ3: ラジオボタンの処理
ラジオボタンでは、どのオプションが選択されているかを確認する必要があります:
// Get selected radio button value
const selectedGender = $('input[name="gender"]:checked').val();
console.log('Selected Gender:', selectedGender); // "female"
// Get all radio button options for a field
const genderOptions: { value: string; label: string; checked: boolean }[] = [];
$('input[name="gender"]').each((index, element) => {
const $radio = $(element);
const value = $radio.val() as string;
const isChecked = $radio.prop('checked') as boolean;
const label = $radio.next('label').text() || $radio.attr('id') || '';
genderOptions.push({
value,
label: label.trim(),
checked: isChecked
});
});
console.log('Gender Options:', genderOptions);
ステップ4: チェックボックスの処理
チェックボックスは複数の選択値を持つことができます:
// Get all checked checkbox values for a field
const selectedInterests: string[] = [];
$('input[name="interests"]:checked').each((index, element) => {
selectedInterests.push($(element).val() as string);
});
console.log('Selected Interests:', selectedInterests); // ["coding", "marketing"]
// Get detailed checkbox information
const interestOptions = [];
$('input[name="interests"]').each((index, element) => {
const $checkbox = $(element);
const value = $checkbox.val() as string;
const isChecked = $checkbox.prop('checked') as boolean;
const label = $checkbox.next('label').text().trim();
interestOptions.push({
value,
label,
checked: isChecked
});
});
ステップ5: セレクトドロップダウン値の抽出
select要素はオプションに対して特別な処理が必要です:
// Get selected option value
const selectedCountry = $('select[name="country"]').val();
console.log('Selected Country:', selectedCountry); // "us"
// Get selected option text
const selectedCountryText = $('select[name="country"] option:selected').text();
console.log('Selected Country Text:', selectedCountryText); // "United States"
// Get all select options
const countryOptions: { value: string; text: string; selected: boolean }[] = [];
$('select[name="country"] option').each((index, element) => {
const $option = $(element);
const value = $option.val() as string;
const text = $option.text().trim();
const selected = $option.prop('selected') as boolean;
countryOptions.push({ value, text, selected });
});
console.log('Country Options:', countryOptions);
ステップ6: Textarea要素の処理
Textareaは値属性ではなく、テキストとしてコンテンツを含んでいます:
// Extract textarea content
const message = $('textarea[name="message"]').text();
console.log('Message:', message); // "Hello, I'm interested in your services..."
// Alternative method using val()
const messageAlt = $('textarea[name="message"]').val();
console.log('Message (alt):', messageAlt);
ステップ7: フォームバリデーションとメタデータの抽出
フォームのメタデータとバリデーションルールを抽出します:
interface FormMetadata {
action: string;
method: string;
enctype?: string;
fieldCount: number;
requiredFields: string[];
}
function extractFormMetadata(formSelector: string): FormMetadata {
const $form = $(formSelector);
const action = $form.attr('action') || '';
const method = ($form.attr('method') || 'get').toLowerCase();
const enctype = $form.attr('enctype');
// Count form fields
const fieldCount = $form.find('input, select, textarea').length;
// Find required fields
const requiredFields: string[] = [];
$form.find('input[required], select[required], textarea[required]').each((index, element) => {
const name = $(element).attr('name');
if (name) requiredFields.push(name);
});
return {
action,
method: method as 'get' | 'post',
enctype,
fieldCount,
requiredFields
};
}
const formMetadata = extractFormMetadata('#contact-form');
console.log('Form Metadata:', formMetadata);
ステップ8: 包括的なフォームデータのシリアライゼーション
Cheerioの組み込みシリアライゼーションメソッドを使用します:
// Using Cheerio's serialize method
const serializedData = $('#contact-form').serialize();
console.log('Serialized Form:', serializedData);
// Using serializeArray for structured data
const formArray = $('#contact-form').serializeArray();
console.log('Form Array:', formArray);
// Convert to object format
const formObject: Record<string, string | string[]> = {};
formArray.forEach(field => {
if (formObject[field.name]) {
// Handle multiple values (checkboxes)
if (Array.isArray(formObject[field.name])) {
(formObject[field.name] as string[]).push(field.value);
} else {
formObject[field.name] = [formObject[field.name] as string, field.value];
}
} else {
formObject[field.name] = field.value;
}
});
完全な動作例
包括的なフォームデータ抽出ソリューションをご紹介します:
import * as cheerio from 'cheerio';
interface FormData {
textFields: Record<string, string>;
radioButtons: Record<string, string>;
checkboxes: Record<string, string[]>;
selects: Record<string, { value: string; text: string }>;
textareas: Record<string, string>;
metadata: {
action: string;
method: string;
fieldCount: number;
requiredFields: string[];
};
}
const complexFormHtml = `
<form id="survey-form" action="/submit-survey" method="post">
<fieldset>
<legend>Personal Information</legend>
<input type="text" name="firstName" value="John" required>
<input type="text" name="lastName" value="Doe" required>
<input type="email" name="email" value="john.doe@email.com" required>
<input type="tel" name="phone" value="+1-555-123-4567">
</fieldset>
<fieldset>
<legend>Preferences</legend>
<div>
<p>Preferred Contact Method:</p>
<input type="radio" name="contactMethod" value="email" checked>
<label>Email</label>
<input type="radio" name="contactMethod" value="phone">
<label>Phone</label>
<input type="radio" name="contactMethod" value="mail">
<label>Mail</label>
</div>
<div>
<p>Interests (select all that apply):</p>
<input type="checkbox" name="interests" value="technology" checked>
<label>Technology</label>
<input type="checkbox" name="interests" value="sports">
<label>Sports</label>
<input type="checkbox" name="interests" value="music" checked>
<label>Music</label>
<input type="checkbox" name="interests" value="travel">
<label>Travel</label>
</div>
</fieldset>
<fieldset>
<legend>Location</legend>
<select name="country" required>
<option value="">Select Country</option>
<option value="us" selected>United States</option>
<option value="ca">Canada</option>
<option value="uk">United Kingdom</option>
<option value="au">Australia</option>
</select>
<select name="timezone">
<option value="est">Eastern Time</option>
<option value="cst" selected>Central Time</option>
<option value="mst">Mountain Time</option>
<option value="pst">Pacific Time</option>
</select>
</fieldset>
<fieldset>
<legend>Additional Information</legend>
<textarea name="comments" rows="4">Looking forward to hearing from you!</textarea>
<textarea name="suggestions" placeholder="Any suggestions?"></textarea>
</fieldset>
<button type="submit">Submit Survey</button>
<button type="reset">Reset Form</button>
</form>
`;
function extractCompleteFormData(html: string, formSelector: string): FormData {
const $ = cheerio.load(html);
const $form = $(formSelector);
// Extract text fields
const textFields: Record<string, string> = {};
$form.find('input[type="text"], input[type="email"], input[type="tel"], input[type="password"]').each((_, element) => {
const $input = $(element);
const name = $input.attr('name');
const value = $input.val() as string;
if (name) textFields[name] = value || '';
});
// Extract radio buttons
const radioButtons: Record<string, string> = {};
const radioGroups = new Set<string>();
$form.find('input[type="radio"]').each((_, element) => {
const name = $(element).attr('name');
if (name) radioGroups.add(name);
});
radioGroups.forEach(groupName => {
const selectedValue = $form.find(`input[name="${groupName}"]:checked`).val() as string;
if (selectedValue) radioButtons[groupName] = selectedValue;
});
// Extract checkboxes
const checkboxes: Record<string, string[]> = {};
const checkboxGroups = new Set<string>();
$form.find('input[type="checkbox"]').each((_, element) => {
const name = $(element).attr('name');
if (name) checkboxGroups.add(name);
});
checkboxGroups.forEach(groupName => {
const values: string[] = [];
$form.find(`input[name="${groupName}"]:checked`).each((_, element) => {
values.push($(element).val() as string);
});
checkboxes[groupName] = values;
});
// Extract selects
const selects: Record<string, { value: string; text: string }> = {};
$form.find('select').each((_, element) => {
const $select = $(element);
const name = $select.attr('name');
if (name) {
const $selectedOption = $select.find('option:selected');
selects[name] = {
value: $selectedOption.val() as string || '',
text: $selectedOption.text().trim()
};
}
});
// Extract textareas
const textareas: Record<string, string> = {};
$form.find('textarea').each((_, element) => {
const $textarea = $(element);
const name = $textarea.attr('name');
if (name) {
textareas[name] = $textarea.val() as string || '';
}
});
// Extract metadata
const metadata = {
action: $form.attr('action') || '',
method: ($form.attr('method') || 'get').toLowerCase(),
fieldCount: $form.find('input, select, textarea, button').length,
requiredFields: $form.find('[required]').map((_, el) => $(el).attr('name')).get().filter(Boolean)
};
return {
textFields,
radioButtons,
checkboxes,
selects,
textareas,
metadata
};
}
// Usage
const formData = extractCompleteFormData(complexFormHtml, '#survey-form');
console.log('=== EXTRACTED FORM DATA ===');
console.log('Text Fields:', formData.textFields);
console.log('Radio Buttons:', formData.radioButtons);
console.log('Checkboxes:', formData.checkboxes);
console.log('Select Fields:', formData.selects);
console.log('Textareas:', formData.textareas);
console.log('Metadata:', formData.metadata);
// Additional utility functions
function validateFormData(data: FormData): { isValid: boolean; errors: string[] } {
const errors: string[] = [];
// Check required fields
data.metadata.requiredFields.forEach(fieldName => {
if (!data.textFields[fieldName] &&
!data.radioButtons[fieldName] &&
!data.selects[fieldName]?.value) {
errors.push(`Required field '${fieldName}' is missing`);
}
});
return {
isValid: errors.length === 0,
errors
};
}
const validation = validateFormData(formData);
console.log('Validation Result:', validation);
この包括的なアプローチは、すべての一般的なフォーム要素を処理し、バリデーション機能付きの構造化されたデータ抽出を提供するため、フォーム分析とデータ処理タスクに最適です。
ガイド3: CheerioでDOM要素をナビゲートおよび変更する方法
DOMナビゲーションと変更は、WebスクレイピングとHTML操作の中核的な側面です。このガイドでは、Cheerioの強力なナビゲーションメソッドを使用して、DOMツリーを効果的に横断し、関連要素を見つけ、コンテンツを変更する方法を実演します。
DOMツリー構造の理解
ナビゲーション技術に入る前に、HTMLの階層的な性質を理解することが重要です:
<div class="container"> <!-- Parent -->
<header class="main-header"> <!-- First child -->
<h1>Page Title</h1> <!-- Grandchild -->
<nav>Navigation</nav> <!-- Sibling to h1 -->
</header>
<main class="content"> <!-- Next sibling to header -->
<article>Article content</article>
<aside>Sidebar</aside>
</main>
<footer>Footer content</footer> <!-- Last child -->
</div>
ステップ1: 基本的な要素選択とナビゲーション
基本的なナビゲーションメソッドから始めます:
import * as cheerio from 'cheerio';
const html = `
<div class="blog-post" data-id="123">
<header class="post-header">
<h1 class="post-title">Understanding Web Scraping</h1>
<div class="post-meta">
<span class="author">John Smith</span>
<time class="published" datetime="2024-01-15">January 15, 2024</time>
<div class="tags">
<span class="tag">web-scraping</span>
<span class="tag">javascript</span>
<span class="tag">cheerio</span>
</div>
</div>
</header>
<div class="post-content">
<p class="intro">Web scraping is a powerful technique...</p>
<p>In this article, we'll explore...</p>
<blockquote>
<p>"Data is the new oil of the digital economy."</p>
<cite>— Industry Expert</cite>
</blockquote>
<p class="conclusion">To summarize...</p>
</div>
<footer class="post-footer">
<div class="social-share">
<button class="share-btn twitter" data-url="https://example.com/post/123">Twitter</button>
<button class="share-btn facebook" data-url="https://example.com/post/123">Facebook</button>
<button class="share-btn linkedin" data-url="https://example.com/post/123">LinkedIn</button>
</div>
<div class="post-actions">
<button class="like-btn" data-likes="42">❤️ 42</button>
<button class="comment-btn" data-comments="8">💬 8</button>
</div>
</footer>
</div>
`;
const $ = cheerio.load(html);
// Basic selection
const postTitle = $('.post-title').text();
console.log('Post Title:', postTitle);
// Navigate to parent
const postHeader = $('.post-title').parent();
console.log('Parent class:', postHeader.attr('class')); // "post-header"
// Navigate to closest ancestor with specific selector
const blogPostContainer = $('.post-title').closest('.blog-post');
console.log('Container data-id:', blogPostContainer.attr('data-id')); // "123"
ステップ2: 兄弟要素間のナビゲーション
兄弟要素間を効率的にナビゲートします:
// Get next sibling
const titleNextSibling = $('.post-title').next();
console.log('Next sibling class:', titleNextSibling.attr('class')); // "post-meta"
// Get all following siblings
const allFollowingSiblings = $('.post-header').nextAll();
console.log('Following siblings count:', allFollowingSiblings.length); // 2
// Get previous sibling
const metaPrevSibling = $('.post-meta').prev();
console.log('Previous sibling tag:', metaPrevSibling.get(0)?.tagName); // "h1"
// Get all preceding siblings
const allPrecedingSiblings = $('.post-footer').prevAll();
console.log('Preceding siblings count:', allPrecedingSiblings.length); // 2
// Get all siblings
const headerSiblings = $('.post-header').siblings();
headerSiblings.each((index, element) => {
console.log(`Sibling ${index + 1}:`, $(element).attr('class'));
});
ステップ3: 子要素と子孫要素のナビゲーション
子要素と深い子孫要素を扱います:
// Get direct children
const postContentChildren = $('.post-content').children();
console.log('Direct children count:', postContentChildren.length);
// Get first child
const firstChild = $('.post-content').children().first();
console.log('First child class:', firstChild.attr('class')); // "intro"
// Get last child