Blog / Linux/ Complete Guide to Batch Downloading Website Directory Files with wget

Complete Guide to Batch Downloading Website Directory Files with wget

使用 wget 命令批量下载网站目录文件:完整指南与示例

Batch Download Files from a Website Directory Using wget

The wget command is a powerful, non-interactive file downloader available on Linux, macOS, and Windows (via WSL or Cygwin). It supports recursive downloading, bandwidth limiting, and file type filtering. This guide explains how to use wget to batch download files from a specific directory on a web server.

Basic Batch Download Command

To download all files from http://demo.abc.com/path/to/file/ without creating subdirectories, use:

wget -nd -r -l1 --no-parent http://demo.abc.com/path/to/file/

Parameter breakdown:

  • -nd (--no-directories): Download all files to the current directory, ignoring the server's directory structure.
  • -r (--recursive): Enable recursive downloading.
  • -l1 (--level=1): Set recursion depth to 1, meaning only files in the specified directory are downloaded; subdirectories are not followed.
  • --no-parent: Prevent downloading files from parent directories.

Filtering by File Extension

To download only specific file types, such as .jpg and .bmp images, use the -A (--accept) option:

wget -nd -r -l1 --no-parent -A.jpg,.bmp http://demo.abc.com/path/to/file/

You can specify multiple patterns separated by commas or use -A multiple times. To exclude file types, use -R (--reject).

Advanced Options and Exclusion Rules

This example downloads all content except .html files, enables link conversion, and ignores robots.txt (use with caution):

wget -c -r -np -k -L --reject=html http://demo.abc.com/path/to/file/ -e robots=off

Key parameters:

  • -c (--continue): Resume partial downloads.
  • -np: Same as --no-parent.
  • -k (--convert-links): Convert absolute links in downloaded HTML files to relative links for local viewing.
  • -L (--relative): Follow relative links only.
  • --reject=html: Do not download files ending with .html.
  • -e robots=off: Ignore the site's robots.txt file. Respect website policies and use this only when you have permission.

Important Considerations

  1. Legal and Ethical Use: Ensure you have the right to download and use the files. Always check the website's robots.txt and terms of service.
  2. URL Rewriting (Pretty URLs): If the site uses URL rewriting, wget might not correctly map URLs to file paths. Use --spider to test link validity first.
  3. Server Load: Batch downloads can strain the target server. Limit download speed with --limit-rate=200k and avoid excessive concurrent connections.

By combining wget options, you can handle various batch download scenarios efficiently. Always test your command with --dry-run first to verify the expected behavior.

Post a Comment

Your email will not be published. Required fields are marked with *.