Data Preprocessing

The data_preprocessing module provides functions for Loading sequence files, Quality control & Preprocessing of the sequences from sequencing data.

read_fasta

Reads a FASTA file and returns a dictionary containing the sequence IDs as keys and the corresponding sequences as values.

Usage:

# Load sequencing data
seq = dp.read_fasta("sequencing_data.fasta")

# Print the sequence
print(seq)

Retrieves a sequence from a public sequence database (such as NCBI) using its accession number.

Usage:

# Load sequencing data
seq = dp.fetch_seq(database, accession_number)

# Print the sequence
print(seq)

Reads vcf file and returns a DataFrame with the variant data

Usage:

# Load data
data = dp.read_vcf(vcf_path)
print(data)

performs quality check on a sequence

Usage:

 # Load data
qc = dp.fasta_quality_check(seq)
print(qc)

Remove reads with low overall quality scores or with too many low-quality bases.

Usage:

# Load data
filtered_data = dp.filter_reads(quality_scores, min_avg_score=20, max_low_quality_bases=5)
print(filtered_data)

Calculate the quality scores for each base in a sequencing read, typically represented as a Phred score.

Usage:

# Load data
score = dp.quality_scores(seq)
print(score)

dentify and remove adapter sequences that may have been introduced during library preparation.

Usage:

# Load data
data = dp.trim_adapters(sequence, adapter='AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC')
print(data)

Identify and remove duplicate reads that may have been introduced during PCR amplification.

Usage:

# Load data
data = dp.remove_duplicates(seq)
print(data)

Identify and remove reads that match known contaminant sequences, such as those from bacterial or viral genomes.

Usage:

# Load data
data = dp.filter_contaminants(seq, contaminants)
print(data)

Generate plots and summary statistics to assess the quality of sequencing data, such as per-base quality scores and read length distributions.

Usage:

# Load data
data = dp.visualise_quality_metrics(sequences, quality_scores)