The Future of AI
Artificial Intelligence is evolving rapidly.
Machine learning models are becoming more powerful.
Read more...Universal Web Scraping Library | Multi-Environment | Zero Dependencies
Understanding the loading process helps you integrate tOnline.js into any environment:
// Step 1: Create script element
const script = document.createElement('script');
script.src = 'https://lib.mediawrite.cloud/tOnline.js';
// Step 2: Define onload to know when ready
let resolveLoad;
const loadPromise = new Promise(r => {
resolveLoad = r;
});
script.onload = resolveLoad;
// Step 3: Add to page (triggers download)
document.head.appendChild(script);
// Step 4: Wait for completion
await loadPromise;
// Step 5: Now tOnline is available
const result = await tOnline.init(...);
What you'll do: Load the library and scrape the current page for headings and paragraphs.
(async()=>{if(window.tOnline&&window.tOnline.init) return window.tOnline;const url='https://lib.mediawrite.cloud/tOnline.js';const txt=await fetch(url).then(r=>r.text());(new Function(txt))();let i=0;while(!(window.tOnline&&window.tOnline.init)&&i<50){await new Promise(r=>setTimeout(r,100));i++;}if(!(window.tOnline&&window.tOnline.init))throw new Error('tOnline load failed');const r=await window.tOnline.init('init','my-page',1,window.location.href,{rules:[{selector:'h1,p',atomize:'block'}]}, {}, document);console.log(r.tContents)})();
const tOnline=require('./tOnline.js');const html='<h1>My Page</h1><p>Content here</p>';(async()=>{const r=await tOnline.init('init','my-job-id',1,'https://example.com',{rules:[{selector:'h1,p',atomize:'block'}]}, {}, html);console.log(r.tContents)})();
Try scraping this HTML sample in real-time:
Artificial Intelligence is evolving rapidly.
Machine learning models are becoming more powerful.
Read more...{
"tO_ID": "demo-page",
"rules": [
{
"name": "Headings",
"selector": "h1",
"atomize": "block"
},
{
"name": "Paragraphs",
"selector": "p",
"atomize": "block"
}
]
}
The tOnline.init() function takes 7
parameters. Here's what each one does:
What it is: Tells the library what operation to perform.
Options:
'init' - Load the library and start scraping'check' - Validate configuration without scraping'compare' - Compare old vs new content'update' - Update content in remote storageExample: tOnline.init('init', ...)
What it is: Unique identifier for the page/document you're scraping.
Purpose: Used to track and deduplicate content in your database.
Naming Convention: Use descriptive names
// Good examples:
'news-cnn-homepage'
'ecommerce-amazon-laptop-page'
'github-repo-tOnline'
'twitter-feed-tech-news'
What it is: Version number of the scraping configuration/page.
Purpose: Track changes over time. When you update your scraping rules, increment this.
Example Flow:
123What it is: The source URL of the content you're scraping.
Purpose: Stored in the output for traceability and auditing.
// Examples:
'https://news.cnn.com/article/ai-breakthrough'
'https://amazon.com/dp/B123456789'
window.location.href // Current page in browser
What it is: Defines WHAT to extract and HOW to extract it.
This is the most important parameter!
Key Properties:
{
rules: [
{
selector: "h1, p", // CSS selector
atomize: "block", // "block", "inline", or "word"
name: "Paragraphs", // Label (optional)
capture: { // Advanced extraction (optional)
keyValue: [...], // Extract key-value pairs
attributes: {...}, // Get HTML attributes
image: {...} // Extract image URLs
}
}
]
}
What it is: Configuration for network/API operations (usually optional).
Use Cases: Proxy settings, authentication, rate limiting
// Typical usage:
{} // Empty object for most cases
// With API credentials:
{
apiKey: 'your-api-key',
proxy: 'https://proxy.example.com',
timeout: 5000
}
What it is: The HTML content to scrape from.
Options:
document - Current browser page (Browser Console)'<html>...' - HTML string (Node.js)jsdom.window.document - JSDOM document object// Browser
tOnline.init('init', 'page-id', 1, url, rules, {}, document);
// Node.js with HTML string
tOnline.init('init', 'page-id', 1, url, rules, {}, htmlString);
// Puppeteer
const page = await browser.newPage();
const html = await page.content();
tOnline.init('init', 'page-id', 1, url, rules, {}, html);
const result = await tOnline.init(
// 1. Action: What operation to perform
'init',
// 2. Document ID: Unique identifier for this content
'product-listing-page-jan-2026',
// 3. Version: Track changes to your scraping rules
1,
// 4. URL: Where this content came from
'https://store.example.com/products?category=laptops',
// 5. Reading Rules: WHAT to extract
{
rules: [
{
name: "Product Cards",
selector: ".product-card",
atomize: "block",
capture: {
keyValue: [
{ key: "name", selector: ".product-name" },
{ key: "price", selector: ".price" }
]
}
}
]
},
// 6. Network Settings: Auth, proxy, etc.
{},
// 7. Input Context: HTML to scrape from
document // or HTML string in Node.js
);
// Access the extracted content
console.log(result.tContents); // Array of extracted items
console.log(result.status); // 'success' or error
The most common scraping task - get readable text content.
<article>
<h1>Breaking News</h1>
<p>Major developments reported today.</p>
<p>Sources confirm the story.</p>
</article>
const result = await tOnline.init('init',
'news-page', // job ID
1, // version
'https://news.example.com',
{
rules: [
{
name: "Headlines",
selector: "h1",
atomize: "block"
},
{
name: "Body",
selector: "p",
atomize: "block"
}
]
},
{}, // network settings
document // or HTML string
);
result.tContents = [
{
tO_Content: "Breaking News",
tO_MetaInfo: { tagName: "H1" }
},
{
tO_Content: "Major developments reported today.",
tO_MetaInfo: { tagName: "P" }
},
...
]
Extract complex objects like product info, profiles, etc.
<div class="product">
<span class="name">Laptop Pro</span>
<span class="price">$1,299</span>
<span class="rating">4.5 stars</span>
</div>
{
rules: [
{
name: "Product",
selector: ".product",
atomize: "block",
capture: {
keyValue: [
{
key: "product_name",
selector: ".name"
},
{
key: "price",
selector: ".price"
},
{
key: "rating",
selector: ".rating"
}
]
}
}
]
}
{
tO_Content: "Laptop Pro $1,299 4.5 stars",
tO_MetaInfo: {
tagName: "DIV",
className: "product",
keyValue: [
{ product_name: "Laptop Pro" },
{ price: "$1,299" },
{ rating: "4.5 stars" }
]
}
}
Automatically iterate over repeating elements.
<ul class="menu">
<li><a href="/home">Home</a></li>
<li><a href="/about">About</a></li>
<li><a href="/contact">Contact</a></li>
</ul>
{
rules: [
{
name: "Menu Items",
selector: "li a",
atomize: "block"
}
]
}
[
{ tO_Content: "Home", tO_MetaInfo: { href: "/home" } },
{ tO_Content: "About", tO_MetaInfo: { href: "/about" } },
{ tO_Content: "Contact", tO_MetaInfo: { href: "/contact" } }
]
The tOSpider module automates the discovery of links and can perform deep scans of entire
sites.
const networkSettings = {
networkIndex: "blog_index_2026",
deepLinkScaning: true, // Enable recursion
deepLinkScaningLevel: 2, // Scan 2 levels deep
deepLinkSameDomain: true, // Stay on the same domain
deepLinkSameDomainTo: "URL", // "URL" = auto-pin to current domain
deepURLsFilters: {
allowedExtensions: [".html", ".php"],
disallowedExtensions: [".pdf", ".jpg"]
}
};
// Trigger autonomous discovery
await tContent.spider.justDoIt(
"discovery_job_01",
1,
window.location.href,
networkSettings
);
tODataStore.upsert() for high-performance indexing.
deepLinkSameDomainTo: "URL", the spider
automatically locks itself to the starting domain. You can also hardcode it to "mysite.com" or
a path like "mysite.com/blog" to restrict the crawl.
Automatically extract href, src, data-* attributes.
<div class="gallery">
<img src="/image1.jpg" alt="Photo 1" />
<img src="/image2.jpg" alt="Photo 2" />
<a href="https://link.com" class="cta">
Download
</a>
</div>
{
rules: [
{
name: "Images",
selector: "img",
atomize: "block"
// src & alt auto-captured in tO_MetaInfo
},
{
name: "Links",
selector: "a.cta",
atomize: "block"
// href auto-captured in tO_MetaInfo
}
]
}
[
{
tO_Content: "",
tO_MetaInfo: {
tagName: "IMG",
src: "/image1.jpg",
alt: "Photo 1"
}
},
{
tO_Content: "Download",
tO_MetaInfo: {
tagName: "A",
href: "https://link.com",
className: "cta"
}
}
]
Scrape any webpage directly from the browser console.
Server-side scraping with file I/O and databases.
Headless Chrome for JavaScript-heavy pages.
Serverless scraping on AWS.
Edge-based scraping and processing.
Any HTTP client can invoke the library.
const tOnline = require('./tOnline.js');
const fs = require('fs');
const fetch = require('node-fetch');
// 1. Fetch page
const response = await fetch('https://example.com');
const html = await response.text();
// 2. Scrape
const result = await tOnline.init('init',
'my-page',
1,
'https://example.com',
{
rules: [
{ selector: 'h1', atomize: 'block' },
{ selector: 'p', atomize: 'block' }
]
},
{},
html
);
// 3. Save to JSON
fs.writeFileSync('output.json',
JSON.stringify(result.tContents, null, 2)
);
console.log(`✓ Extracted ${result.tContents.length} items`);
const puppeteer = require('puppeteer');
const tOnline = require('./tOnline.js');
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// 1. Navigate to page
await page.goto('https://example.com');
// 2. Wait for dynamic content
await page.waitForSelector('.dynamic-content');
// 3. Get HTML
const html = await page.content();
// 4. Scrape with tOnline
const result = await tOnline.init('init',
'dynamic-page',
1,
'https://example.com',
{
rules: [
{ selector: '.dynamic-content', atomize: 'block' }
]
},
{},
html
);
console.log(result.tContents);
await browser.close();
})();
const tOnline = require('./tOnline.js');
exports.handler = async (event) => {
try {
const { url, html, config } = JSON.parse(event.body);
const result = await tOnline.init('init',
config.tO_ID || 'lambda-job',
config.tO_V || 1,
url,
config,
{},
html
);
return {
statusCode: 200,
body: JSON.stringify({
success: true,
count: result.tContents.length,
data: result.tContents
})
};
} catch (error) {
return {
statusCode: 400,
body: JSON.stringify({ error: error.message })
};
}
};
await tOnline.init(action, tID, tV, url, readingRules, networkSettings, inputContext)
| Parameter | Type | Description | Example |
|---|---|---|---|
action |
string | 'init' | 'check' |
'init' |
tID |
string | Job/document ID | 'article_123' |
tV |
number | Version number | 1 |
url |
string | Source URL | 'https://example.com' |
readingRules |
object | Selectors & parsing config | { rules: [...] } |
networkSettings |
object | Optional network config | {} |
inputContext |
string | DOM | HTML string or DOM element | document or htmlString |
{
status: 'success',
action: 'init',
tContents: [
{
tO_ID: 'article_123',
tO_V: 1,
tO_Time: 1672531200000,
tO_URI: 'https://example.com',
tO_Content: 'Extracted text...',
tO_Title: 'Page Title',
tO_MetaInfo: {
tagName: 'P',
className: 'content',
href: 'https://link.com'
},
// ... plus all identity fields
tO_id_hash_hex: 'a1b2c3...',
tO_hash_hex: 'd4e5f6...',
tO_cid_v0: 'QmY4RSx8...'
}
]
}
// CSS Selectors work just like in JavaScript
// Single element
"selector": "h1"
// Multiple selectors
"selector": "h1, h2, h3"
// By class
"selector": ".article-body"
// By ID
"selector": "#main-content"
// Nested
"selector": ".product .price"
// Attributes
"selector": "[data-id]"
"selector": "a[href*='example.com']"
// Pseudo-selectors
"selector": "li:first-child"
"selector": "p:not(.hidden)"
// "block" - each matched element is separate "atom"
{
"selector": "p",
"atomize": "block"
}
// Result: Each <p> becomes one tContent
// "inline" - all text in element treated as one
{
"selector": "div",
"atomize": "inline"
}
// Result: Full div content = one tContent
// "sentence" - split by periods (requires Intl.Segmenter)
{
"selector": "article",
"atomize": "sentence"
}
// Result: Each sentence = one tContent
<div class="products">
<div class="product">
<h3>Laptop</h3>
<span class="price">$999</span>
<span class="rating">⭐ 4.5</span>
<a href="/product/123">Details</a>
</div>
<div class="product">
<h3>Mouse</h3>
...
</div>
</div>
const config = {
tO_ID: 'shop-products',
rules: [
{
name: 'Product Card',
selector: '.product',
atomize: 'block',
capture: {
keyValue: [
{ key: 'name', selector: 'h3' },
{ key: 'price', selector: '.price' },
{ key: 'rating', selector: '.rating' },
{ key: 'url', attr: 'href' }
]
}
}
]
};
const result = await tOnline.init('init',
config.tO_ID,
1,
'https://shop.example.com',
config,
{},
document
);
async function scrapeMultiplePages(baseUrl, pages, config) {
let allResults = [];
for (let page = 1; page <= pages; page++) {
const url = `${baseUrl}?page=${page}`;
const response = await fetch(url);
const html = await response.text();
const result = await tOnline.init('init',
config.tO_ID,
config.tO_V,
url,
config,
{},
html
);
allResults = allResults.concat(result.tContents);
console.log(`✓ Page ${page}: ${result.tContents.length} items`);
}
return allResults;
}
// Usage
const data = await scrapeMultiplePages(
'https://shop.com/products',
5,
{ tO_ID: 'products', rules: [...] }
);
// Each tContent has built-in deduplication fields:
//
// tO_hash_hex - SHA-256 of content (for detecting duplicates)
// tO_sim_hash - SimHash (for finding similar content)
// tO_cid_v0 - IPFS address (for content-based lookup)
// Example: Find duplicate content
const results = [
{ tO_hash_hex: 'abc123', tO_Content: 'Article A' },
{ tO_hash_hex: 'abc123', tO_Content: 'Article A' },
{ tO_hash_hex: 'def456', tO_Content: 'Article B' }
];
const unique = results.filter((item, index, self) =>
index === self.findIndex(t => t.tO_hash_hex === item.tO_hash_hex)
);
console.log(`Unique: ${unique.length} (removed ${results.length - unique.length} duplicates)`);
Selector doesn't match anything
// ✗ Wrong
"selector": ".article-title"
// ✓ Right - use browser DevTools to find actual class
// Open F12, right-click element, inspect
// Copy the class/id from the HTML
Node.js needs JSDOM to parse HTML strings
// Install it:
npm install jsdom
// Or use the DOM object directly (not HTML string)
await tOnline.init(..., document);
Use Puppeteer for dynamically rendered pages
// Don't use plain fetch for JS-heavy sites
// Use Puppeteer instead (see environment setup)
Tips for best results
// 1. Test selectors in browser console
document.querySelectorAll('your-selector').length
// 2. Check tO_MetaInfo for captured attributes
console.log(result.tContents[0].tO_MetaInfo)
// 3. Use tO_hash_hex to track document identity
// across systems and time