Data Preprocessing
The data_preprocessing module provides functions for Loading sequence files, Quality control & Preprocessing of the sequences from sequencing data.
read_fasta
Reads a FASTA file and returns a dictionary containing the sequence IDs as keys and the corresponding sequences as values.
Usage:
# Load sequencing data
seq = dp.read_fasta("sequencing_data.fasta")
# Print the sequence
print(seq)
fetch_seq
Retrieves a sequence from a public sequence database (such as NCBI) using its accession number.
Usage:
# Load sequencing data
seq = dp.fetch_seq(database, accession_number)
# Print the sequence
print(seq)
read_vcf
Reads vcf file and returns a DataFrame with the variant data
Usage:
# Load data
data = dp.read_vcf(vcf_path)
print(data)
fasta_quality_check
performs quality check on a sequence
Usage:
# Load data
qc = dp.fasta_quality_check(seq)
print(qc)
filter_reads
Remove reads with low overall quality scores or with too many low-quality bases.
Usage:
# Load data
filtered_data = dp.filter_reads(quality_scores, min_avg_score=20, max_low_quality_bases=5)
print(filtered_data)
quality_scores
Calculate the quality scores for each base in a sequencing read, typically represented as a Phred score.
Usage:
# Load data
score = dp.quality_scores(seq)
print(score)
trim_adapters
dentify and remove adapter sequences that may have been introduced during library preparation.
Usage:
# Load data
data = dp.trim_adapters(sequence, adapter='AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC')
print(data)
remove_duplicates
Identify and remove duplicate reads that may have been introduced during PCR amplification.
Usage:
# Load data
data = dp.remove_duplicates(seq)
print(data)
filter_contaminants
Identify and remove reads that match known contaminant sequences, such as those from bacterial or viral genomes.
Usage:
# Load data
data = dp.filter_contaminants(seq, contaminants)
print(data)
visualise_quality_metrics
Generate plots and summary statistics to assess the quality of sequencing data, such as per-base quality scores and read length distributions.
Usage:
# Load data
data = dp.visualise_quality_metrics(sequences, quality_scores)