conda-index - Man Page

Name

conda-index — conda-index

conda index, formerly part of conda-build. Create repodata.json for collections of conda packages.

The conda_index command operates on a channel directory. A channel directory contains a noarch subdirectory at a minimum and will almost always contain other subdirectories named for conda's supported platforms linux-64, win-64, osx-64, etc. A channel directory cannot have the same name as a supported platform. Place packages into the same platform subdirectory each archive was built for. Conda-index extracts metadata from these packages to generate index.html, repodata.json etc. with summaries of the packages' metadata. Then conda uses the metadata to solve dependencies before doing an install.

By default, the metadata is output to the same directory tree as the channel directory, but it can be output to a separate tree with the --output <output> parameter. The metadata cache is always placed with the packages, in .cache folders under each platform subdirectory.

After conda-index has finished, its output can be used as a channel conda install -c file:///path/to/output ... or it would typically be placed on a web server.

Run Normally

python -m conda_index <path to channel directory>

Note conda index (instead of python -m conda_index) may find legacy conda-build index.

Run for Debugging

python -m conda_index --verbose --threads=1 <path to channel directory>

Contributing

conda create -n conda-index "python >=3.9" conda conda-build "pip >=22"

git clone https://github.com/conda/conda-index.git
pip install -e conda-index[test]

cd conda-index
pytest

Summary of Changes from the Previous Conda-Build Index Version

Parallelism

This version of conda-index continues indexing packages from other subdirs while the main thread is writing a repodata.json.

All current_repodata.json are generated in parallel. This may use a lot of ram if repodata.json has tens of thousands of entries.

Command-line interface

python -m conda_index

python -m conda_index [OPTIONS] DIR

Options

--output <output>

Output repodata to given directory.

--subdir <subdir>

Subdir to index. Accepts multiple.

-n,  --channel-name <channel_name>

Customize the channel name listed in each channel's index.html.

--patch-generator <patch_generator>

Path to Python file that outputs metadata patch instructions from its _patch_repodata function or a .tar.bz2/.conda file which contains a patch_instructions.json file for each subdir

--channeldata,  --no-channeldata

Generate channeldata.json.

Default

False

--rss,  --no-rss

Write rss.xml (Only if --channeldata is enabled).

Default

True

--bz2,  --no-bz2

Write repodata.json.bz2.

Default

False

--zst,  --no-zst

Write repodata.json.zst.

Default

False

--run-exports,  --no-run-exports

Write run_exports.json.

Default

False

--compact,  --no-compact

Output JSON as one line, or pretty-printed.

Default

True

-m,  --current-index-versions-file <current_index_versions_file>

YAML file containing name of package as key, and list of versions as values.  The current_index.json will contain the newest from this series of versions.  For example:

python:
  • 3.8
  • 3.9

will keep python 3.8.X and 3.9.Y in the current_index.json, instead of only the very latest python version.

--threads <threads>
Default

48

--verbose

Enable debug logging.

Arguments

DIR

Required argument

conda_index

conda_index.index

This module provides the main entry point to create indexes from collections of conda packages.

conda_index.index.update_index(dir_path, output_dir=None, check_md5=False, channel_name=None, patch_generator=None, threads: int | None = 48, verbose=False, progress=False, subdirs=None, warn=True, current_index_versions=None, debug=False, write_bz2=True, write_zst=False, write_run_exports=False)

High-level interface to ChannelIndex. Index all subdirs under dir_path. Output to output_dir, or under the input directory if output_dir is not given. Writes updated channeldata.json.

The input dir_path should at least contain a directory named noarch. The path tree therein is treated as a full channel, with a level of subdirs, each subdir having an update to repodata.json. The full channel will also have a channeldata.json file.

class conda_index.index.ChannelIndex(channel_root, channel_name, subdirs=None, threads: int | None = 48, deep_integrity_check=False, debug=False, output_root=None, cache_class=<class 'conda_index.index.sqlitecache.CondaIndexCache'>, write_bz2=False, write_zst=False, write_run_exports=False, compact_json=True)

Class implementing update_index. Allows for more fine-grained control of output.

See the implementation of conda_index.cli for usage.

index(patch_generator, verbose=False, progress=False, current_index_versions=None)

Examine all changed packages under self.channel_root, updating index.html for each subdir.

update_channeldata(rss=False)

Update channeldata based on re-reading output repodata.json and existing channeldata.json. Call after index() if channeldata is needed.

Database schema

Standalone conda-index uses a per-subdir sqlite database to track package metadata, unlike the older version which used millions of tiny .json files. The new strategy is much faster because we don't have to pay for many individual stat() or open() calls.

The whole schema looks like this:

<subdir>/.cache % sqlite3 cache.db
SQLite version 3.41.2 2023-03-22 11:56:21
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE about (path TEXT PRIMARY KEY, about BLOB);
CREATE TABLE index_json (path TEXT PRIMARY KEY, index_json BLOB);
CREATE TABLE recipe (path TEXT PRIMARY KEY, recipe BLOB);
CREATE TABLE recipe_log (path TEXT PRIMARY KEY, recipe_log BLOB);
CREATE TABLE run_exports (path TEXT PRIMARY KEY, run_exports BLOB);
CREATE TABLE post_install (path TEXT PRIMARY KEY, post_install BLOB);
CREATE TABLE icon (path TEXT PRIMARY KEY, icon_png BLOB);
CREATE TABLE stat (
                stage TEXT NOT NULL DEFAULT 'indexed',
                path TEXT NOT NULL,
                mtime NUMBER,
                size INTEGER,
                sha256 TEXT,
                md5 TEXT,
                last_modified TEXT,
                etag TEXT
            );
CREATE UNIQUE INDEX idx_stat ON stat (path, stage);
CREATE INDEX idx_stat_stage ON stat (stage, path);
sqlite> select stage, path from stat where path like 'libcurl%';
fs|libcurl-7.84.0-hc6d1d07_0.conda
fs|libcurl-7.86.0-h0f1d93c_0.conda
fs|libcurl-7.87.0-h0f1d93c_0.conda
fs|libcurl-7.88.1-h0f1d93c_0.conda
fs|libcurl-7.88.1-h9049daf_0.conda
indexed|libcurl-7.84.0-hc6d1d07_0.conda
indexed|libcurl-7.86.0-h0f1d93c_0.conda
indexed|libcurl-7.87.0-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h0f1d93c_0.conda
indexed|libcurl-7.88.1-h9049daf_0.conda

Most of these tables store json-format metadata extracted from each package.

select * from index_json where path = 'libcurl-7.88.1-h9049daf_0.conda';
libcurl-7.88.1-h9049daf_0.conda|{"build":"h9049daf_0","build_number":0,"depends":["krb5 >=1.20.1,<1.21.0a0","libnghttp2 >=1.51.0,<2.0a0","libssh2 >=1.10.0,<2.0a0","libzlib >=1.2.13,<1.3.0a0","openssl >=3.0.8,<4.0a0"],"license":"curl","license_family":"MIT","name":"libcurl","subdir":"osx-arm64","timestamp":1676918523934,"version":"7.88.1","md5":"c86bbee944bb640609670ce722fba9a4","sha256":"37b8d58c05386ac55d1d8e196c90b92b0a63f3f1fe2fa916bf5ed3e1656d8e14","size":321706}

To track whether a package is indexed in the cache or not, conda-index uses a table named stat. The main point of this table is to assign a stage value to each artifact filename; usually 'fs' which is called the upstream stage, and 'indexed'. 'fs' means that the artifact is now available in the set of packages (assumed by default to be the local filesystem). 'indexed' means that the entry already exists in the database (same filename, same timestamp, same hash), and its package metadata has been extracted to the index_json etc. tables. Paths in 'fs' but not in 'indexed' need to be unpacked to have their metadata added to the database. Paths in 'indexed' but not in 'fs' will be ignored and left out of repodata.json.

First, conda-index adds all files in a subdir to the upstream stage. This involves a listdir() and stat() for each file in the index. The default upstream stage is named fs, but this step is designed to be overridden by subclassing CondaIndexCache() and replacing the save_fs_state() and changed_packages() methods. By overriding CondexIndexCache() it is possible to index without calling stat() on each package, or without even having all packages stored on the indexing machine.

Next, conda-index looks for all changed_packages(): paths in the upstream (fs) stage that don't exist in or have a different  modification time than those in thie indexed stage.

Finally, a join between the upstream stage, usually 'fs', and the index_json table yields a basic repodata_from_packages.json without any repodata patches.

SELECT path, index_json FROM stat JOIN index_json USING (path) WHERE stat.stage = :upstream_stage

The steps to create repodata.json, including any repodata patches, and to create current_repodata.json with only the latest versions of each package, are similar to pre-sqlite3 conda-index.

The other cached metadata tables are used to create channeldata.json.

Sample queries

Megabytes added per day:

select
  date(mtime, 'unixepoch') as d,
  printf('%0.2f', sum(size) / 1e6) as MB
from
  stat
group by
  date(mtime, 'unixepoch')
order by
  mtime desc
  • Index
  • Module Index
  • Search Page

Author

conda

Info

Jun 29, 2024