2. Basic Computing Skills — Command Line & File Handling
Most bioinformatics tools run on command-line interfaces rather than point-and-click software. Learning a few basic commands will help you handle large datasets efficiently.
a. Introduction to the Command Line
-
Why use it?
It’s faster, can handle big files, and is required for many bioinformatics tools like BLAST, BWA, or GROMACS.
-
Getting started: Open Terminal (Mac/Linux) or PowerShell/WSL (Windows).
-
Basic commands to learn:
-
pwd → Show current location.
-
ls → List files in a folder.
-
cd foldername → Move into a folder.
-
mkdir foldername → Create a folder.
-
cp file1 file2 → Copy files.
-
mv file1 file2 → Move or rename files.
-
rm filename → Delete files (careful!).
b. File Handling Skills
📌 Learning Tip: Practice by creating a folder named week1_practice, downloading a small .fasta file from NCBI, and exploring it with head and grep.
3. Introduction to Bioinformatics Databases
Databases are the libraries of bioinformatics — they store sequences, annotations, and biological insights. You’ll be using them constantly.
a. NCBI (National Center for Biotechnology Information)
-
What it offers: DNA/RNA sequences, protein sequences, genome assemblies, literature (PubMed), BLAST search.
-
Beginner activity:
-
Go to NCBI Nucleotide.
-
Search for “human hemoglobin gene.”
-
Download the sequence in FASTA format.
-
Open it in a text editor to see the sequence.
b. UniProt
c. Ensembl (Optional for Week 1)
📌 Learning Tip: Keep a bioinformatics notebook (physical or digital) where you record:
End of Week 1 Goals
By the end of this week, you should be able to:
-
Explain what DNA, RNA, and proteins are and how they relate to each other.
-
Navigate and download data from NCBI and UniProt.
-
Use basic command-line commands to navigate folders, open files, and search inside them.
Week 2 — Learning Core Tools
This week you move from basic literacy to doing. You’ll learn the file formats you’ll encounter every day, how to search sequences with BLAST, and how to align sequences (pairwise and multiple alignment). Below is a guided, beginner-friendly walk-through with concrete examples, commands you can try, common pitfalls, and short exercises.
1) File formats you must know: FASTA, FASTQ, GFF/GTF
FASTA
-
What it stores: DNA, RNA or protein sequences.
-
Structure: a header line starting with > followed by one or more sequence lines.
-
Uses: reference genomes, protein databases, individual sequences for BLAST/MSA.
-
Tools for quick inspection: head, less, seqkit, Biopython.
-
Pitfalls: some tools expect line-wrapped sequences (≤80 chars) while others accept single-line sequences. Keep headers clean (no spaces or special chars if you plan to script).
FASTQ
-
What it stores: sequencing reads + per-base quality scores (typical from Illumina).
-
Structure: 4 lines per read:
-
Uses: raw NGS reads (input for QC, trimming, alignment).
-
Important: quality encoding — modern Illumina uses Phred+33. Some old datasets use Phred+64; check with FastQC.
-
Common commands:
-
Pitfalls: mismatched sequence/quality lengths and truncated files will break tools.
GFF / GTF (annotation formats)
-
What they store: genome feature coordinates (genes, exons, CDS, etc.)
-
GFF3 example (9 columns):
-
GTF is a GFF variant used by Ensembl/others; it uses a slightly different attribute syntax (e.g., gene_id "XYZ"; transcript_id "ABC";).
-
Uses: feeding annotation to read counters (featureCounts), genome browsers, visualization.
-
Pitfalls: coordinate bases — GFF/GTF are 1-based inclusive, while BED is 0-based. Also attribute field format differs between tools; check requirements.
2) Sequence search with BLAST
Why BLAST?
BLAST (Basic Local Alignment Search Tool) finds regions of similarity between sequences — the fastest way to find homologs, infer function, or identify contamination.
BLAST flavors (common)
-
blastn — nucleotide query vs nucleotide DB
-
blastp — protein query vs protein DB
-
blastx — translated nucleotide query → protein DB (useful for transcripts)
-
tblastn — protein query vs translated nucleotide DB
-
tblastx — translate both query and DB (rare, slow)
Quick online option (beginner-friendly)
-
Use NCBI BLAST web: paste your FASTA, choose database (nr, refseq), run, inspect top hits.
Local BLAST (for practice)
-
create a DB:
run BLAST and request tabular output:
How to read results (key columns)
-
% identity — percent identical residues in alignment
-
alignment length — length of aligned region
-
E-value — expected number of alignments by chance; smaller = more significant (e.g., 1e-10)
-
bit score — normalized alignment score (higher better)
-
query coverage — how much of your query is covered by the hit (important for partial matches)
Practical tips
-
Use -max_target_seqs carefully — it restricts reported hits.
-
For function inference, use high identity + good coverage and prefer hits to curated sequences (Swiss-Prot) over unreviewed ones.
3) Alignment basics: pairwise vs multiple sequence alignment (MSA)
Pairwise alignment
-
Align two sequences to find similarities (Smith-Waterman local, Needleman-Wunsch global). Commonly done within BLAST for local alignments.
-
Good for checking homology between two sequences or validating a BLAST hit.
Multiple Sequence Alignment (MSA)
-
Align >2 sequences to reveal conserved regions and motifs — essential to identify functionally important residues.
-
Common MSA tools:
-
Clustal Omega — scalable, fast for many sequences
-
MUSCLE — good accuracy for moderately sized sets
-
MAFFT — fast and accurate, many modes for different dataset sizes
-
Command examples:
-
Clustal Omega (CLI):
-
MUSCLE:
-
MAFFT:
-
Web options: EBI Clustal Omega, EMBL-EBI MAFFT — great for beginners.
Visualizing MSA
-
Jalview, AliView, UGENE show alignments and highlight conserved columns, consensus, and secondary structure predictions.
-
Look for conserved blocks, highly conserved residues (often functional), and variable regions.
4) Extra essential tools & steps
A. Read quality control (if you have FASTQ)
-
FastQC — run on raw reads to inspect per-base quality, GC content, adapter contamination, etc.
-
Example:
-
Fix issues with cutadapt or Trimmomatic (adapter trimming & quality trimming).
B. Short-read mapping (intro)
-
Mapping tools (BWA, Bowtie2) align NGS reads to a reference genome; output SAM/BAM — required for variant calling.
-
Example BWA mem:
-
You don’t have to master mapping this week, but know it exists and is the next step after QC.
C. Fast sequence utilities
5) Practice exercises (concrete, small wins)
-
Inspect a FASTA file
-
Download a small protein FASTA (e.g., hemoglobin alpha) from UniProt.
-
head -n 5 myproteins.fasta
-
seqkit stats myproteins.fasta
-
Run a BLAST search (web)
-
Run a local BLAST (optional)
-
Do an MSA
-
If you have a small FASTQ
Week 2 goals
-
Explain and identify FASTA, FASTQ, GFF formats and common pitfalls (Phred encoding, coordinate bases).
-
Run a BLAST search (web and basic local) and interpret results (E-value, identity, coverage).
-
Produce an MSA for a small set of homologous sequences and visually inspect conserved motifs.
-
Run FastQC on a FASTQ file and understand the basic quality metrics.
-
Know the next tools to learn: read mappers (BWA/Bowtie2), SAM/BAM handling (samtools), and variant calling basics.
Quick troubleshooting & tips
-
If BLAST returns many low-quality hits, increase -evalue stringency (use e.g., 1e-5 or 1e-10) and check coverage.
-
For MSA: if the result looks noisy, remove very divergent sequences — they can ruin alignment accuracy.
-
Keep raw files backed up. Always work on copies when trimming or transforming files.
By the end of Week 2, you should be comfortable recognizing and inspecting FASTA/FASTQ/GFF files, running BLAST (web & basic local), producing a clean MSA, and interpreting key quality metrics from FastQC. You’ve also mapped your first reads with BWA/Bowtie2 and seen how alignments are stored in SAM/BAM.
Week 3 — From Raw Reads to Meaningful Results
By Week 3, you will already be working with databases, file formats, and core tools. Now, you are processing raw sequencing reads into analysis-ready results. This means taking FASTQ files, aligning them to a reference genome, converting them into optimized formats, calling variants, and visualizing results.
In Python, we achieve this by using libraries that wrap around bioinformatics tools or directly manipulate sequence/alignment data.
1) Mapping Reads to a Reference Genome
In bioinformatics, mapping is like matching fingerprints — each short DNA/RNA read must be placed exactly where it belongs in the reference genome.
While mapping is typically performed with tools like BWA-MEM or Bowtie2, in Python you can:
Example — running BWA from Python and reading the SAM output: Here, pysam reads the SAM output so you can check mapping quality directly inside Python.
2) Converting SAM to BAM, Sorting, and Indexing
SAM is a large text-based format; BAM is its binary, compressed equivalent. Sorting by genomic coordinates and indexing speeds up data access.
3) Variant Calling with Python Wrappers
While bcftools is the most common tool for variant calling, we can still integrate it into a Python workflow using subprocess, and then read results with libraries like cyvcf2 or PyVCF.
4) Visualizing Alignments & Variants
You can prepare files for visualization in IGV directly from Python, ensuring that:
Example: index VCF for IGV: Once done, you open IGV, load:
-
reference.fa
-
aligned_sorted.bam
-
variants_sorted.vcf.gz
Then you can zoom into any gene to visually inspect alignments and SNPs.
End of Week 3 Goals
-
Automate mapping, BAM processing, and variant calling from Python
-
Instead of manually running bwa, samtools, and bcftools commands in the terminal, you use Python scripts to run them automatically.
-
This ensures reproducibility (same steps every time) and saves time when processing multiple datasets.
-
Example workflow automated in Python:
FASTQ → Align with BWA → Convert SAM to BAM → Sort → Index → Call variants → Output VCF.
-
Inspect alignment quality programmatically
-
Using Python libraries like pysam, you can quickly check the number of mapped vs. unmapped reads, mapping quality scores, and specific read details.
-
This allows you to catch problems early (e.g., poor quality mapping) before continuing to downstream analysis.
-
Load BAM + VCF into IGV for manual inspection
-
Even after automation, visual inspection is important.
-
You prepare sorted, indexed BAM and indexed VCF files so they can be loaded into IGV.
-
In IGV, you can zoom in on genes, see how reads align, and visually confirm variant calls.
-
Understand how Python scripts can integrate command-line bioinformatics tools into a reproducible pipeline
-
Most bioinformatics tools are command-line based, but Python can wrap these commands (using subprocess) and connect them with file handling, data parsing, and QC checks.
-
This creates a pipeline that can be re-run anytime on new data without manual intervention.
-
The approach combines the power of established tools with the flexibility of Python scripting.
Week 4 — Your First Mini Bioinformatics Project
By now, you’ve practiced the essential bioinformatics skills:
-
Searching and downloading data from databases
-
Understanding file formats (FASTQ, FASTA, SAM/BAM, VCF)
-
Running BLAST and multiple sequence alignments (MSA)
-
Performing quality control
-
Doing read mapping and basic variant calling
Now it’s time to integrate all of this into a small, reproducible project that you can showcase. This is the kind of project you can put on GitHub and mention in a resume or LinkedIn post.
1) Choose a Simple, Real Dataset
You don’t need to generate new data — there’s a wealth of public sequencing datasets you can work with:
-
NCBI SRA (Sequence Read Archive)
-
Contains WGS (whole genome sequencing), RNA-Seq, and other data for thousands of organisms.
-
Example: Download a small bacterial genome dataset for quick processing.
-
ENA (European Nucleotide Archive)
-
Galaxy Project Shared Data
Beginner Tip:
Start with bacterial datasets — they’re small (a few MBs), process fast, and make interpreting results much easier compared to large eukaryotic genomes.
2) Define a Simple Research Question
Your project needs a clear biological goal. Examples:
-
Variant discovery
“Does my bacterial strain have unique SNPs compared to the reference genome?”
-
Gene expression
“Which genes are most highly expressed in my RNA-Seq sample?”
-
Species identification
“Can I identify the organism from a mystery FASTQ file?”
Why this matters:
A defined question helps you choose the right tools, reduces wasted work, and makes your final report more meaningful.
3) Build Your Mini-Pipeline
Think of this as a step-by-step recipe for your analysis.
Example 1 — DNA Variant Calling Pipeline
-
Download FASTQ files from SRA/ENA.
-
Quality check with FastQC.
-
Trim adapters & low-quality bases with cutadapt or Trimmomatic.
-
Map reads to a reference genome with BWA-MEM.
-
Convert SAM → BAM, sort, and index with samtools.
-
Call variants (SNPs, indels) with bcftools.
-
Annotate variants with SnpEff (optional, adds biological meaning).
-
Visualize in IGV.
Example 2 — RNA-Seq Differential Expression Pipeline (simplified)
-
Download FASTQ files.
-
Run FastQC and trim if needed.
-
Map reads to reference genome or transcriptome using HISAT2.
-
Count reads per gene with featureCounts.
-
Perform differential expression analysis with DESeq2 (R).
-
Visualize using heatmaps, volcano plots, or gene expression bar charts.
Pro tip:
Keep your pipeline modular — each step is a separate command or script so you can easily replace tools or datasets later.
4) Document Your Work
A good bioinformatician is not only able to run analyses, but also make them reproducible.
-
Keep all commands in a text file or Jupyter Notebook.
-
Record software versions (important for reproducibility).
-
Save intermediate outputs (e.g., BAM, VCF), but delete large raw files when space is limited.
-
Create a GitHub repository with:
-
README.md explaining the project’s goal, dataset source, and pipeline.
-
Scripts (Python, shell, R) used in the analysis.
-
Small example dataset so others can test your code.
Why this matters:
This makes your work sharable, reviewable, and re-runnable by others — a key skill for professional bioinformatics work.
5) Share Your Results
The last step is communication — your analysis is only as good as your ability to explain it.
Ways to share:
-
Write a short blog post summarizing findings.
-
Post plots (e.g., coverage graphs, variant tables, gene expression heatmaps).
-
Share your GitHub link on LinkedIn or in bioinformatics forums.
-
If possible, create an interactive Jupyter Notebook so others can explore your results.
Portfolio impact:
A clear, documented, and publicly available project is a strong portfolio item that shows both technical skills and scientific thinking.
Week 4 end results
In Week 4, you bring together everything you’ve learned into a single, reproducible pipeline. This week is less about mastering new commands and more about applying existing skills to a real-world dataset. By selecting a simple but meaningful dataset, defining a clear biological question, running an end-to-end analysis, and documenting every step, you create a tangible project that demonstrates your technical ability and problem-solving mindset. This process mirrors real bioinformatics workflows and gives you something concrete to share with peers, potential collaborators, or employers.
Conclusion: From Raw Data to Real Biological Insights
Over these four weeks, we’ve journeyed from the fundamentals of bioinformatics to building a complete, reproducible analysis pipeline. We began by exploring databases, formats, and essential tools, progressed through sequence alignment and variant calling, and culminated in a Week 4 capstone project that integrates every skill into a real-world application.
This approach mirrors how bioinformatics is practiced in research and industry: starting with raw sequencing reads, applying rigorous quality control, performing precise alignments, and interpreting results in the context of a biological question. Whether it’s identifying genetic variants in a bacterial genome or uncovering expression patterns in RNA-Seq data, you now have both the knowledge and the framework to go from messy raw data to meaningful biological insights.
The real power lies in reproducibility, documentation, and sharing — skills that make your work not only useful for today but valuable for future projects and collaborations. By mastering these workflows, you’re not just learning tools; you’re preparing to contribute to scientific discovery in a way that is transparent, collaborative, and impactful.
Remember — every large scientific breakthrough begins with someone taking the first step, even if it’s just analyzing a single dataset. Your progress so far proves that with curiosity, persistence, and clear methodology, you can tackle complex problems and uncover insights that matter. Keep building, keep exploring, and let your work be the spark for the discoveries of tomorrow.
💬 Let’s Discuss
🧬 If you had access to any real-world dataset, what biological question would you want to answer using your new skills?
OR
🔍 Which part of the bioinformatics workflow — databases, mapping, variant calling, or visualization — do you think will be most crucial for future scientific breakthroughs, and why?
Share your thoughts, project ideas, or favorite tools in the comments — your next great bioinformatics project might start here!