Data Preprocessing

The data_preprocessing module provides functions for Loading sequence files, Quality control & Preprocessing of the sequences from sequencing data.

read_fasta

Reads a FASTA file and returns a dictionary containing the sequence IDs as keys and the corresponding sequences as values.

Usage:

# Load sequencing data
seq = dp.read_fasta("sequencing_data.fasta")

# Print the sequence
print(seq)

fetch_seq

Retrieves a sequence from a public sequence database (such as NCBI) using its accession number.

Usage:

# Load sequencing data
seq = dp.fetch_seq(database, accession_number)

# Print the sequence
print(seq)

read_vcf

Reads vcf file and returns a DataFrame with the variant data

Usage:

# Load data
data = dp.read_vcf(vcf_path)
print(data)

fasta_quality_check

performs quality check on a sequence

Usage:

 # Load data
qc = dp.fasta_quality_check(seq)
print(qc)

filter_reads

Remove reads with low overall quality scores or with too many low-quality bases.

Usage:

# Load data
filtered_data = dp.filter_reads(quality_scores, min_avg_score=20, max_low_quality_bases=5)
print(filtered_data)

quality_scores

Calculate the quality scores for each base in a sequencing read, typically represented as a Phred score.

Usage:

# Load data
score = dp.quality_scores(seq)
print(score)

trim_adapters

dentify and remove adapter sequences that may have been introduced during library preparation.

Usage:

# Load data
data = dp.trim_adapters(sequence, adapter='AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC')
print(data)

remove_duplicates

Identify and remove duplicate reads that may have been introduced during PCR amplification.

Usage:

# Load data
data = dp.remove_duplicates(seq)
print(data)

filter_contaminants

Identify and remove reads that match known contaminant sequences, such as those from bacterial or viral genomes.

Usage:

# Load data
data = dp.filter_contaminants(seq, contaminants)
print(data)

visualise_quality_metrics

Generate plots and summary statistics to assess the quality of sequencing data, such as per-base quality scores and read length distributions.

Usage:

# Load data
data = dp.visualise_quality_metrics(sequences, quality_scores)