# Vamb inputs and outputs
All modes of Vamb takes various _inputs_ and produces various _outputs_.
Currently, all modes take the following two central inputs:

* The kmer-composition of the sequence (the _composition_).
* The abundance of the contigs in each sample (the _abundance_).

For inputs that take significant time to produce, Vamb will serialize the parsed input to a file, such that future runs of Vamb can use that instead of re-computing it.

## Composition
The composition is computed from the input contig file in FASTA format (the 'catalogue').
From command line, this looks like:

```shell
--fasta contigs.fna.gz
```

Where the catalogue may be either gzipped or a plain FASTA file.

Vamb produces the best results when run with the "multi-split" workflow, as demonstrated in the quickstart section in "how to run Vamb".
In this workflow, samples are assembled independently, and the resulting contigs are concatenated to a single FASTA file before binning.
After binning, the bins can be split into sample-wise pure bins.

To do this splitting (termed "binsplitting"), Vamb needs to know which contig came from which sample.
Therefore, it's recommended that the FASTA headers are formatted in the following pattern:
`{samplename}C{contigname}`

Where `{samplename}` is some text that uniquely identifies each sample, and `{contigname}` uniquely identifies each contig within a sample.
For example, if the samples are named S1, S2, S3, etc., and the contigs are named 1, 2, 3, etc, a FASTA header may be `S3C119`.

After the `composition.npz` has been created, Vamb will write the composition in the output file `composition.npz`.
Future runs of Vamb can then instead use the following option to load the composition directly:

```shell
--composition composition.npz
```

## Abundance
The abundance may be computed from either:
* A TSV file with the header being "contigname" followed by one sample name per sample,
  and the values in the TSV file being precomputed abundances.
  These may be derived from `paste`ing together outputs from the tool `strobealign --aemb`, and concatenating the result to a TSV header.
* A directory of sorted BAM files generated by mapping the reads of each sample to the contig catalogue.

On the command line, abundance input can be specified as:
```shell
--abundance_tsv abundance.tsv
```
or
```shell
--bamdir dir_with_bam_files
```

Once the abundance has been parsed, Vamb will produce the file `abundance.npz`, which can be used for future
Vamb runs instead:
```shell
--abundance abundance.npz
```

### Abundance TSV format
This follows the ordinary `.tsv` format, with a mandatory header, and disallowing `\t` in contig names.
The header of the first column name must be exactly `contigname`.

The abundance TSV file in the correct format can be created from the output of `strobealign --aemb` using the script in `src/merge_aemb.py`:

```shell
python src/merge_aemb.py input_dir abundance.tsv
```

Example file:
```
contigname	S1	S2	S3
S1C1	1.53	1.11	4.1e2
S1C2	0.94	9.2	5.1e2
S2C1	1.2e-3	0	9.2
S3C1	88.21	51.2	12.1e3
S3C2	14.1	90.1	13.01
```

### Abundance as BAM files
If you don't want to compute abundance using `strobealign --aemb` and create a TSV file from its output (recommended),
Vamb can compute abundances from BAM files.

To do this:

* Create the FASTA contig catalogue as described in the section of "composition".
* Map the reads for each sample to the catalogue, to obtain on BAM file per sample.

Using the aligner [minimap2](https://github.com/lh3/minimap2) as well as [samstrip](https://github.com/jakobnissen/samstrip) and [samtools](https://github.com/samtools/samtools), the commands may be:

```shell
# Index the FASTA file so it only has to be done once instead of on every mapping
minimap2 -I 32G -d catalogue.mmi catalogue.fasta;

# Map each sample. Here, using 8 threads, using paired short reads.
minimap2 -t 8 -ax sr catalogue.mmi s1.fw.fq.gz s1.rv.fq.gz | samstrip | samtools view -F 3584 -b - > s1.bam
```

_Note that if you use minimap2 specifically, be aware of [this bug in minimap2](https://github.com/lh3/minimap2/issues/37), where, if the index
is not large enough, the output will be an invalid SAM file. To get around this, use enough RAM when indexing (e.g. set option `-I` appropriately)_

### Reference hash checking
To ensure the integrity of the data, Vamb will compare the identifiers in the composition (ultimately: headers in the FASTA file) with the contig names from the abundance input (TSV file contig names, or BAM sequence names) and, if provided, that of the taxonomic input.

To do this efficiently, the identifiers are _hashed_ to provide a _reference hash_ (refhash), which is compared, and an error is thrown if they are different.

If you, for some reason, can't create input files with matching identifiers, and you are 100% sure the order of the sequences is identical in the composition input and abundance input, you can disable this reference hashing with the `--norefcheck` option.

## Taxonomy
Vamb operates with two kinds of taxonomies:
* _Unrefined_ taxonomies give the taxonomic annotation for each contig
* _Refined_ taxonomies gives the taxonomic annotation _plus a probability estimate_ for each contig

Vamb's __Taxometer__ tool can be used to refine a taxonomy.
It takes an unrefined taxonomy as input, and outputs a refined taxonomy.
TaxVamb usually achieves better results if its taxonomy input is refined with Taxometer.

Both refined and unrefined taxonomies can be used for TaxVamb.
By default, if TaxVamb gets an unrefined taxonomy, it will automatically refine it with Taxometer, unless `--no_predict` is passed.

Taxonomy files are TSV files with the following format:
* Header: `contigs\tpredictions` for unrefined taxonomies and `contigs\tpredictions\tscores` for refined ones.
* In the `contigs` column: The FASTA identifier for every contig in the catalogue.
* In the `predictions` column: A semicolon-separated string with taxonomic levels, for each of the following seven ranks, in order:
  domain, phylum, class, order, family, genus, species. Lower ranks may be omitted.
  There is no requirement that the labels are actually meaningful, i.e. that they correspond to any real taxonomic clade.
* In the `scores` column: A semicolon separated list of floats, one per element in the `predictions` column.

The following are examples of a VALID rows in the prediction column:
```
Bacteria;Bacillota;Clostridia
Bacteria;Bacillota;Bacilli;Bacillales
Bacteria;Pseudomonadota;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter sp. TTH0-4
```

The following are example of INVALID rows in the prediction column:
* Invalid: Begins with class instead of domain: `Clostridia;Eubacteriales;Lachnospiraceae;Roseburia;Roseburia hominis`
* Invalid: Skips the phylum: `Bacteria;Gammaproteobacteria;Moraxellales;Moraxellaceae;Acinetobacter;Acinetobacter sp. TTH0-4`

The following is an example of a valid, unrefined taxonomy file:
```
contigs	predictions
S18C13	Bacteria;Bacillota;Clostridia;Eubacteriales
S18C25	Bacteria;Pseudomonadota
S18C67	Bacteria;Bacillota;Bacilli;Bacillales;Staphylococcaceae;Staphylococcus
```

Our tool  [__Taxconverter__](https://github.com/RasmussenLab/taxconverter) can be used to create unrefined taxonomy files from MMSeqs2, Centrifuge, Kraken2, Metabuli or MetaMaps output files.

# Outputs

## Vamb
- `log.txt` - A text file with information about the Vamb run. Look here (and at stderr) if you experience errors.
- `composition.npz`: A Numpy .npz file that contain all kmer composition information computed by Vamb from the FASTA file.
  This can be provided to another run of Vamb to skip the composition calculation step.
  This is not produced if an existing `composition.npz` was used to run Vamb.
- `abundance.npz`: Similar to `composition.npz`, but this file contains information calculated from the abundance TSV file (or BAM files).
  Using this as input instead of BAM files will skip re-parsing the BAM files, which take a significant amount of time.
  This file is not produced if an existing `abundance.npz` was used to run Vamb.
- `model.pt`: A file containing the trained VAE model. When running Vamb from a Python interpreter, the VAE can be loaded from this file to skip training.
- `latent.npz`: This contains the output of the VAE model, the embedding of each of the contigs.
- `bins`: If `--minfasta` is set, this is a directory with one FASTA file per bin, after binsplitting.
- `vae_clusters_unsplit.tsv` - A two-column TSV text with the header `clustername\tcontigname`, then one row per sequence:
  Left column for the cluster (i.e bin) name, right column for the sequence name.
  You can create the FASTA-file bins themselves using the script in `src/create_fasta.py`
- (if binsplitting is enabled:) `vae_clusters_split.tsv`, similar to the unsplit version, but after binsplitting.
  See the section on binsplitting on the page"tips for running Vamb".
- `vae_clusters_metadata.tsv`: A file with some metadata about clusters.
    - Name: The name of the cluster
    - Radius: Cosine radius in embedding space. Small clusters are usually more likely to be pure.
    - Peak/valley ratio: A small PVR means the cluster's edges is more well defined, and hence the cluster is more likely pure
    - Kind: Currently, Vamb produces three kinds of clusters:
        - Normal: Defined by a local density in latent space. Most good clusters are of this type
        - Loner: A contig far away from everything else in latent space.
        - Fallback: After failing to produce good clusters for some time, these (usually poor) clusters are created
          to not get stuck in an infinite loop when clustering
    - Bp: Sum of length of all sequences in the cluster
    - Ncontigs: Number of sequences in the cluster
    - Medoid: Name of contig used as the cluster's medoid, i.e. the center of the cluster

## TaxVamb
* `log.txt`, `composition.npz` and `abundance.npz`: Same as when running `Vamb`
* `predictor_model.pt` and `results_taxometer.tsv`: If Taxometer was used to automatically refine TaxVamb. See the Taxometer output section.
* `vaevae_clusters_{split,unsplit,metadata}.tsv`: Same as when running `Vamb`, but from TaxVamb's VAEVAE model
* `vaevae_model.pt` A PyTorch model with the trained VAEVAE model.

## Taxometer
* `log.txt`, `composition.npz` and `abundance.npz`: Same as when running `Vamb`
* `predictor_model.pt`: A PyTorch model file containing the trained predictor.
* `results_taxometer.tsv`: A refined taxonomy file (see the section on files on the "how to run" page)

## AVAMB
Same as Vamb, but also:
- `aae_y_clusters_{split,unsplit}.tsv`: The clusters obtained from the categorical latent space
- `aae_z_latent.npz`: Like `latent.npz`, but of the adversarial Z latent space
- `aae_z_clusters_{metadata,split,unsplit}.tsv`: Like the corresponding `vae_clusters*` files, but from the adversarial Z latent space

