🕷️ tOnline.js

Universal Web Scraping Library | Multi-Environment | Zero Dependencies

✓ Browser Console

✓ Node.js

✓ Puppeteer

✓ Lambda

✓ Cloudflare Workers

✓ Spidering & Links

✓ No Dependencies

📡 How Script Loading Works

Understanding the loading process helps you integrate tOnline.js into any environment:

The 3-Step Process:

Create: Generate a script tag pointing to tOnline.js
Wait: Use Promise/async-await to pause until script loads
Use: Call tOnline.init() with your scraping rules

Step-by-Step Breakdown
// Step 1: Create script element
const script = document.createElement('script');
script.src = 'https://lib.mediawrite.cloud/tOnline.js';

// Step 2: Define onload to know when ready
let resolveLoad;
const loadPromise = new Promise(r => {
  resolveLoad = r;
});
script.onload = resolveLoad;

// Step 3: Add to page (triggers download)
document.head.appendChild(script);

// Step 4: Wait for completion
await loadPromise;

// Step 5: Now tOnline is available
const result = await tOnline.init(...);

🚀 Quick Start (30 seconds)

Browser Console

What you'll do: Load the library and scrape the current page for headings and paragraphs.

Copy & Paste to Console

(async()=>{if(window.tOnline&&window.tOnline.init) return window.tOnline;const url='https://lib.mediawrite.cloud/tOnline.js';const txt=await fetch(url).then(r=>r.text());(new Function(txt))();let i=0;while(!(window.tOnline&&window.tOnline.init)&&i<50){await new Promise(r=>setTimeout(r,100));i++;}if(!(window.tOnline&&window.tOnline.init))throw new Error('tOnline load failed');const r=await window.tOnline.init('init','my-page',1,window.location.href,{rules:[{selector:'h1,p',atomize:'block'}]}, {}, document);console.log(r.tContents)})();

Node.js / Puppeteer

Node.js Setup

const tOnline=require('./tOnline.js');const html='<h1>My Page</h1><p>Content here</p>';(async()=>{const r=await tOnline.init('init','my-job-id',1,'https://example.com',{rules:[{selector:'h1,p',atomize:'block'}]}, {}, html);console.log(r.tContents)})();

✓ Ready! Your library is now active and ready to scrape.

🎯 Live Demo

Try scraping this HTML sample in real-time:

Sample HTML:

The Future of AI

Artificial Intelligence is evolving rapidly.

Machine learning models are becoming more powerful.

Scraping Rules:

            {
  "tO_ID": "demo-page",
  "rules": [
    {
      "name": "Headings",
      "selector": "h1",
      "atomize": "block"
    },
    {
      "name": "Paragraphs",
      "selector": "p",
      "atomize": "block"
    }
  ]
}
          

Output:

Click "Run Demo" to see extracted content...

📚 Common Use Cases & Examples

⚙️ Understanding tOnline.init() Parameters

The tOnline.init() function takes 7 parameters. Here's what each one does:

1. Action (string)

What it is: Tells the library what operation to perform.

Options:

'init' - Load the library and start scraping
'check' - Validate configuration without scraping
'compare' - Compare old vs new content
'update' - Update content in remote storage

Example: tOnline.init('init', ...)

2. Document ID (string)

What it is: Unique identifier for the page/document you're scraping.

Purpose: Used to track and deduplicate content in your database.

Naming Convention: Use descriptive names

              // Good examples:
            'news-cnn-homepage'
            'ecommerce-amazon-laptop-page'
            'github-repo-tOnline'
            'twitter-feed-tech-news'
            

3. Version (number)

What it is: Version number of the scraping configuration/page.

Purpose: Track changes over time. When you update your scraping rules, increment this.

Example Flow:

First scrape: version = 1
Update rules: version = 2
Major redesign: version = 3

4. URL (string)

What it is: The source URL of the content you're scraping.

Purpose: Stored in the output for traceability and auditing.

              // Examples:
            'https://news.cnn.com/article/ai-breakthrough'
            'https://amazon.com/dp/B123456789'
            window.location.href  // Current page in browser
            

5. Reading Rules (object)

What it is: Defines WHAT to extract and HOW to extract it.

This is the most important parameter!

Key Properties:

              {
              rules: [
                {
                  selector: "h1, p",        // CSS selector
                  atomize: "block",         // "block", "inline", or "word"
                  name: "Paragraphs",       // Label (optional)
                  capture: {                // Advanced extraction (optional)
                    keyValue: [...],        // Extract key-value pairs
                    attributes: {...},      // Get HTML attributes
                    image: {...}            // Extract image URLs
                  }
                }
              ]
            }
            

6. Network Settings (object)

What it is: Configuration for network/API operations (usually optional).

Use Cases: Proxy settings, authentication, rate limiting

              // Typical usage:
            {}  // Empty object for most cases

            // With API credentials:
            {
              apiKey: 'your-api-key',
              proxy: 'https://proxy.example.com',
              timeout: 5000
            }
            

7. Input Context (document or HTML string)

What it is: The HTML content to scrape from.

Options:

document - Current browser page (Browser Console)
'<html>...' - HTML string (Node.js)
jsdom.window.document - JSDOM document object

              // Browser
            tOnline.init('init', 'page-id', 1, url, rules, {}, document);

            // Node.js with HTML string
            tOnline.init('init', 'page-id', 1, url, rules, {}, htmlString);

            // Puppeteer
            const page = await browser.newPage();
            const html = await page.content();
            tOnline.init('init', 'page-id', 1, url, rules, {}, html);
            

📋 Complete Example with All Parameters Explained

const result = await tOnline.init(
              // 1. Action: What operation to perform
              'init',
  
              // 2. Document ID: Unique identifier for this content
              'product-listing-page-jan-2026',
  
              // 3. Version: Track changes to your scraping rules
              1,
  
              // 4. URL: Where this content came from
              'https://store.example.com/products?category=laptops',
  
              // 5. Reading Rules: WHAT to extract
              {
                rules: [
                  {
                    name: "Product Cards",
                    selector: ".product-card",
                    atomize: "block",
                    capture: {
                      keyValue: [
                        { key: "name", selector: ".product-name" },
                        { key: "price", selector: ".price" }
                      ]
                    }
                  }
                ]
              },
  
              // 6. Network Settings: Auth, proxy, etc.
              {},
  
              // 7. Input Context: HTML to scrape from
              document  // or HTML string in Node.js
            );

            // Access the extracted content
            console.log(result.tContents);  // Array of extracted items
            console.log(result.status);     // 'success' or error

📄 Extract Text from Headings & Paragraphs

The most common scraping task - get readable text content.

HTML Input
<article>
  <h1>Breaking News</h1>
  <p>Major developments reported today.</p>
  <p>Sources confirm the story.</p>
</article>

Scraping Rule

const result = await tOnline.init('init',
  'news-page',        // job ID
  1,                  // version
  'https://news.example.com',
  {
    rules: [
      {
        name: "Headlines",
        selector: "h1",
        atomize: "block"
      },
      {
        name: "Body",
        selector: "p",
        atomize: "block"
      }
    ]
  },
  {},                 // network settings
  document            // or HTML string
);

Output
result.tContents = [
  {
    tO_Content: "Breaking News",
    tO_MetaInfo: { tagName: "H1" }
  },
  {
    tO_Content: "Major developments reported today.",
    tO_MetaInfo: { tagName: "P" }
  },
  ...
]

🔍 Extract Structured Data (Key-Value Pairs)

Extract complex objects like product info, profiles, etc.

HTML Input
<div class="product">
  <span class="name">Laptop Pro</span>
  <span class="price">$1,299</span>
  <span class="rating">4.5 stars</span>
</div>

Scraping Rule with Key-Value Capture

{
  rules: [
    {
      name: "Product",
      selector: ".product",
      atomize: "block",
      capture: {
        keyValue: [
          {
            key: "product_name",
            selector: ".name"
          },
          {
            key: "price",
            selector: ".price"
          },
          {
            key: "rating",
            selector: ".rating"
          }
        ]
      }
    }
  ]
}

Output
{
  tO_Content: "Laptop Pro $1,299 4.5 stars",
  tO_MetaInfo: {
    tagName: "DIV",
    className: "product",
    keyValue: [
      { product_name: "Laptop Pro" },
      { price: "$1,299" },
      { rating: "4.5 stars" }
    ]
  }
}

💡 Tip: Use keyValue capture to extract and structure related fields within a container. Perfect for e-commerce, job listings, etc.

📋 Extract Lists (UL/LI, DIV arrays)

Automatically iterate over repeating elements.

HTML Input
<ul class="menu">
  <li><a href="/home">Home</a></li>
  <li><a href="/about">About</a></li>
  <li><a href="/contact">Contact</a></li>
</ul>

Scraping Rule

{
  rules: [
    {
      name: "Menu Items",
      selector: "li a",
      atomize: "block"
    }
  ]
}

Output (3 separate atoms)
[
  { tO_Content: "Home", tO_MetaInfo: { href: "/home" } },
  { tO_Content: "About", tO_MetaInfo: { href: "/about" } },
  { tO_Content: "Contact", tO_MetaInfo: { href: "/contact" } }
]

💡 Tip: Each list item becomes a separate "atom" (document). Perfect for pagination, comments, etc.

🕷️ Autonomous Link Discovery (Spider)

The tOSpider module automates the discovery of links and can perform deep scans of entire sites.

Deep Scan Configuration

const networkSettings = {
    networkIndex: "blog_index_2026",
    deepLinkScaning: true,            // Enable recursion
    deepLinkScaningLevel: 2,          // Scan 2 levels deep
    deepLinkSameDomain: true,         // Stay on the same domain
    deepLinkSameDomainTo: "URL",      // "URL" = auto-pin to current domain
    deepURLsFilters: {
        allowedExtensions: [".html", ".php"],
        disallowedExtensions: [".pdf", ".jpg"]
    }
};

// Trigger autonomous discovery
await tContent.spider.justDoIt(
    "discovery_job_01", 
    1, 
    window.location.href, 
    networkSettings
);

✓ Bulk Ingestion: Discovered links are automatically batched (50 at a time) and sent to tODataStore.upsert() for high-performance indexing.

💡 Domain Pinning: By setting deepLinkSameDomainTo: "URL", the spider automatically locks itself to the starting domain. You can also hardcode it to "mysite.com" or a path like "mysite.com/blog" to restrict the crawl.

🏷️ Capture Links, Images & Attributes

Automatically extract href, src, data-* attributes.

HTML Input
<div class="gallery">
  <img src="/image1.jpg" alt="Photo 1" />
  <img src="/image2.jpg" alt="Photo 2" />
  <a href="https://link.com" class="cta">
    Download
  </a>
</div>

Scraping Rule

{
  rules: [
    {
      name: "Images",
      selector: "img",
      atomize: "block"
      // src & alt auto-captured in tO_MetaInfo
    },
    {
      name: "Links",
      selector: "a.cta",
      atomize: "block"
      // href auto-captured in tO_MetaInfo
    }
  ]
}

Output
[
  {
    tO_Content: "",
    tO_MetaInfo: {
      tagName: "IMG",
      src: "/image1.jpg",
      alt: "Photo 1"
    }
  },
  {
    tO_Content: "Download",
    tO_MetaInfo: {
      tagName: "A",
      href: "https://link.com",
      className: "cta"
    }
  }
]

✓ Auto-Detection: href, src, data-* attributes are automatically captured - no extra config needed!

🌍 Setup for Different Environments

🌐 Browser Console

Scrape any webpage directly from the browser console.

F12 to open

No setup

🐍 Node.js

Server-side scraping with file I/O and databases.

npm install jsdom

Recommended

🎭 Puppeteer

Headless Chrome for JavaScript-heavy pages.

npm install puppeteer

For dynamic sites

⚡ AWS Lambda

Serverless scraping on AWS.

No external libs

Cost-effective

☁️ Cloudflare Workers

Edge-based scraping and processing.

Edge computing

Global

💼 Custom HTTP

Any HTTP client can invoke the library.

Flexible

Any runtime

Environment-Specific Code

Node.js with File Output

const tOnline = require('./tOnline.js');
const fs = require('fs');
const fetch = require('node-fetch');

// 1. Fetch page
const response = await fetch('https://example.com');
const html = await response.text();

// 2. Scrape
const result = await tOnline.init('init',
  'my-page',
  1,
  'https://example.com',
  {
    rules: [
      { selector: 'h1', atomize: 'block' },
      { selector: 'p', atomize: 'block' }
    ]
  },
  {},
  html
);

// 3. Save to JSON
fs.writeFileSync('output.json', 
  JSON.stringify(result.tContents, null, 2)
);

console.log(`✓ Extracted ${result.tContents.length} items`);

Puppeteer for Dynamic Content

const puppeteer = require('puppeteer');
const tOnline = require('./tOnline.js');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  
  // 1. Navigate to page
  await page.goto('https://example.com');
  
  // 2. Wait for dynamic content
  await page.waitForSelector('.dynamic-content');
  
  // 3. Get HTML
  const html = await page.content();
  
  // 4. Scrape with tOnline
  const result = await tOnline.init('init',
    'dynamic-page',
    1,
    'https://example.com',
    {
      rules: [
        { selector: '.dynamic-content', atomize: 'block' }
      ]
    },
    {},
    html
  );
  
  console.log(result.tContents);
  await browser.close();
})();

AWS Lambda Handler

const tOnline = require('./tOnline.js');

exports.handler = async (event) => {
  try {
    const { url, html, config } = JSON.parse(event.body);
    
    const result = await tOnline.init('init',
      config.tO_ID || 'lambda-job',
      config.tO_V || 1,
      url,
      config,
      {},
      html
    );
    
    return {
      statusCode: 200,
      body: JSON.stringify({
        success: true,
        count: result.tContents.length,
        data: result.tContents
      })
    };
  } catch (error) {
    return {
      statusCode: 400,
      body: JSON.stringify({ error: error.message })
    };
  }
};

📖 API Reference

Main Method: tOnline.init()

        await tOnline.init(action, tID, tV, url, readingRules, networkSettings, inputContext)
      

Parameter	Type	Description	Example
`action`	string	`'init'` \| `'check'`	`'init'`
`tID`	string	Job/document ID	`'article_123'`
`tV`	number	Version number	`1`
`url`	string	Source URL	`'https://example.com'`
`readingRules`	object	Selectors & parsing config	`{ rules: [...] }`
`networkSettings`	object	Optional network config	`{}`
`inputContext`	string \| DOM	HTML string or DOM element	`document` or `htmlString`

Response Format

Success Response
{
  status: 'success',
  action: 'init',
  tContents: [
    {
      tO_ID: 'article_123',
      tO_V: 1,
      tO_Time: 1672531200000,
      tO_URI: 'https://example.com',
      tO_Content: 'Extracted text...',
      tO_Title: 'Page Title',
      tO_MetaInfo: {
        tagName: 'P',
        className: 'content',
        href: 'https://link.com'
      },
      // ... plus all identity fields
      tO_id_hash_hex: 'a1b2c3...',
      tO_hash_hex: 'd4e5f6...',
      tO_cid_v0: 'QmY4RSx8...'
    }
  ]
}

🔧 Common Patterns & Tips

1️⃣ Selectors: How to Find Elements

            // CSS Selectors work just like in JavaScript

// Single element
"selector": "h1"

// Multiple selectors
"selector": "h1, h2, h3"

// By class
"selector": ".article-body"

// By ID
"selector": "#main-content"

// Nested
"selector": ".product .price"

// Attributes
"selector": "[data-id]"
"selector": "a[href*='example.com']"

// Pseudo-selectors
"selector": "li:first-child"
"selector": "p:not(.hidden)"
          

2️⃣ Atomize: How to Split Content

            // "block" - each matched element is separate "atom"
{
  "selector": "p",
  "atomize": "block"
}
// Result: Each <p> becomes one tContent

// "inline" - all text in element treated as one
{
  "selector": "div",
  "atomize": "inline"
}
// Result: Full div content = one tContent

// "sentence" - split by periods (requires Intl.Segmenter)
{
  "selector": "article",
  "atomize": "sentence"
}
// Result: Each sentence = one tContent
          

3️⃣ Real-World Example: E-Commerce Product Listing

HTML
<div class="products">
  <div class="product">
    <h3>Laptop</h3>
    <span class="price">$999</span>
    <span class="rating">⭐ 4.5</span>
    <a href="/product/123">Details</a>
  </div>
  <div class="product">
    <h3>Mouse</h3>
    ...
  </div>
</div>

Scraping Rule

const config = {
  tO_ID: 'shop-products',
  rules: [
    {
      name: 'Product Card',
      selector: '.product',
      atomize: 'block',
      capture: {
        keyValue: [
          { key: 'name', selector: 'h3' },
          { key: 'price', selector: '.price' },
          { key: 'rating', selector: '.rating' },
          { key: 'url', attr: 'href' }
        ]
      }
    }
  ]
};

const result = await tOnline.init('init',
  config.tO_ID,
  1,
  'https://shop.example.com',
  config,
  {},
  document
);

4️⃣ Pagination: Scrape Multiple Pages

async function scrapeMultiplePages(baseUrl, pages, config) {
  let allResults = [];
  
  for (let page = 1; page <= pages; page++) {
    const url = `${baseUrl}?page=${page}`;
    const response = await fetch(url);
    const html = await response.text();
    
    const result = await tOnline.init('init',
      config.tO_ID,
      config.tO_V,
      url,
      config,
      {},
      html
    );
    
    allResults = allResults.concat(result.tContents);
    console.log(`✓ Page ${page}: ${result.tContents.length} items`);
  }
  
  return allResults;
}

// Usage
const data = await scrapeMultiplePages(
  'https://shop.com/products',
  5,
  { tO_ID: 'products', rules: [...] }
);

5️⃣ Deduplication: Using Hashes

            // Each tContent has built-in deduplication fields:
//
// tO_hash_hex - SHA-256 of content (for detecting duplicates)
// tO_sim_hash - SimHash (for finding similar content)
// tO_cid_v0 - IPFS address (for content-based lookup)

// Example: Find duplicate content
const results = [
  { tO_hash_hex: 'abc123', tO_Content: 'Article A' },
  { tO_hash_hex: 'abc123', tO_Content: 'Article A' },
  { tO_hash_hex: 'def456', tO_Content: 'Article B' }
];

const unique = results.filter((item, index, self) =>
  index === self.findIndex(t => t.tO_hash_hex === item.tO_hash_hex)
);

console.log(`Unique: ${unique.length} (removed ${results.length - unique.length} duplicates)`);
          

🐛 Troubleshooting

❌ "No results returned"

Selector doesn't match anything

            // ✗ Wrong
"selector": ".article-title"

// ✓ Right - use browser DevTools to find actual class
// Open F12, right-click element, inspect
// Copy the class/id from the HTML
          

❌ "JSDOM not found"

Node.js needs JSDOM to parse HTML strings

            // Install it:
npm install jsdom

// Or use the DOM object directly (not HTML string)
await tOnline.init(..., document);
          

❌ "JavaScript content not showing"

Use Puppeteer for dynamically rendered pages

            // Don't use plain fetch for JS-heavy sites
// Use Puppeteer instead (see environment setup)
          

✓ Good to Know

Tips for best results

            // 1. Test selectors in browser console
document.querySelectorAll('your-selector').length

// 2. Check tO_MetaInfo for captured attributes
console.log(result.tContents[0].tO_MetaInfo)

// 3. Use tO_hash_hex to track document identity
// across systems and time
          

Ready to Start Scraping?

📥 Library File 💻 GitHub Repo 📚 Documentation 🌐 Live Demo

Quick Copy-Paste Template

// Load library
const script = document.createElement('script');
script.src = 'https://lib.mediawrite.cloud/tOnline.js';
script.onload = () => {
  // Define your scraping rules
  const config = {
    tO_ID: 'my-job',
    rules: [
      { selector: 'h1, p', atomize: 'block' }
    ]
  };
  
  // Run scraping
  tOnline.init('init', config.tO_ID, 1, window.location.href, config, {}, document)
    .then(result => console.log('✓ Scraped:', result.tContents))
    .catch(err => console.error('Error:', err));
};
document.head.appendChild(script);

Version: 2.0.0 | License: MIT | Last Updated: January 2026