Scripts and Codebooks

Codebooks

Find code notes and scripts on GitHub https://github.com/pschafran/Notes

These Python scripts were written for tasks that I found useful. They are presented with little documentation or support, but if you have questions I will try to assist.

    PURC Tools
    Scripts for working with PURC output

  • PURC_cleanup.py | Removes likely spurious clusters from PURC results based on 10% size threshold (reads supporting a cluster represent less than 10% of total reads for the sample). Creates CSV file reporting stats by sample (number of clusters, total reads, reads/cluster as raw number and %).
  • PURC_assignClustersToDiploids.py | With user-supplied list of samples, takes a distance matrix of PURC clusters and determines nearest member in user-list for each sample in the matrix. Outputs CSV files with summaries of multiple sequences from samples and each unique combination (i.e. compare polyploid sequences to diploids).
  • PURC_countBarcodes.py | Counts the number of occurrences of barcodes (or barcode pairs) from PacBio CCS files.
  • Other Tools

  • getScaffoldsFromFasta.py | Extract sequence(s) from a multi-FASTA file based on sequence name, returns a new file with the sequence(s).
    Command Line: getScaffoldsFromFasta.py scaffoldFile.fasta ScaffoldID(.txt with names separated one per line)
  • exif2gps.py | Extracts GPS coordinates and elevation data from iPhone photos and outputs them as CSV and GPX files for import into Google Earth or ArcGIS.
    Command line: exif2gps.py Images.jpg
    [Updated Nov. 2017 -- Fixed elevation measurements showing up as fractions in output files (now decimals)]
  • gff2tbl.py | Converts a GFF3 annotation file to the 5 column tbl format for GenBank submissions. Currently transfers only NCBI Feature Key, gene, product, note, protein_id, codon_start, and exceptions.
    Command line: gff2tbl.py annotationFiles.gff
    [Updated Nov. 2017 -- Modified to expect GenBank accessions (e.g. KYxxxxxxx.1) as input filenames. Now handles some variation in GFF format and batch file inputs (also merges multiple files into a single output). Requires unique names for every annotation that won't be joined (i.e. spliced coding sequences). Still requires some manual editting at the end (truncated annotations)]
  • spadesStats.py | Reports # of scaffolds, total length, and average length of scaffolds produced by SPAdes assembler.
    Command line: spadesStats.py scaffolds.fasta
    Note: Only works on small assemblies (e.g. bacterial genomes). Erroneous stats reported when run on whole eukaryote genomes.
  • TRIM_to_PEAR.py | Wrapper that runs paired-end Illumina reads through Trimmomatic and PEAR. All files must be in directory with trimmomatic-0.3.3.jar. File names must be formatted: UniqueID1_forward.fastq, UniqueID1_reverse.fastq. Adapter file for trimmomatic must be in ./adapters/ and named MiSeq.fa
    Command line: TRIM_to_PEAR.py UniqueID1_forward.fastq UniqueID1_reverse.fastq ... repeat for all samples.
  • phylospitter.py | Splits a FASTA alignment into separate alignments of chosen length. Also creates shell scripts for submitting each new alignment for MrBayes and RAxML analysis on a computing cluster (qsub).
    Command line: phylosplitter.py alignment.fasta split_length[must be integer]