4. Additional Examples of using the cfDNAPro R Package 

4.1. Fragment Size Data 

In this tutorial, we will be working with fragment size metrics data, which are commonly generated in cfDNA research using Picard.

Picard is a set of command line tools for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. These file formats are defined in the Hts-specs repository. See the SAM specification and the VCF specification. [1]

Fragment size, also commonly referred to as cfDNA fragment length, can be directly extracted from BAM files using cfDNAPro, or alternatively, the package can read insert size metric files generated by the Picard tool. This tutorial will specifically focus on processing insert size metrics from .txt files, which can be generated using the CollectInsertSizeMetrics Picard function.

4.2. Processing .txt Files 

To utilize the cfDNAPro package, organize all .txt files generated by Picard into sub-folders named according to each cohort even if there is only one cohort. Example .txt files are included with the installation of this package.

Below are the steps to access the example data:

library(cfDNAPro)
# Get the path to example data in the cfDNAPro package library path.
data_path <- examplePath("groups_picard")

The example data is organized into four groups, each stored in its own sub-folder named cohort_1, cohort_2, cohort_3, and cohort_4. Each folder contains specific samples, represented as .txt files.

To review the files within these directories, use the list.files function with the following parameters: list.files(data_path, full.names = TRUE, recursive = TRUE)

Note: The cohorts labeled as cohort_1, cohort_2, cohort_3, and cohort_4 are not associated with any biological significance. These are simulated datasets not linked to any specific cancer types. In the following sections, you will encounter several plots demonstrating biological cfDNA feature patterns. Please note that these plots and data are intended solely to illustrate the functionality of cfDNAPro for data analysis and visualization purposes.

In your own analysis, you can apply these functions to analyze cfDNA NGS sequencing data from cancer samples, helping you uncover fragmentation features that are relevant to your research questions.

4.3. Creating a Plot for an Individual Group/Cohort 

We have data from four groups and would like to plot one specific cohort (e.g., cohort_1). This is how can we plot the three .txt files in cohort_1:

cohort1_plot <- cfDNAPro::callSize(path = data_path) %>%
dplyr::filter(group == as.character("cohort_1")) %>%
cfDNAPro::plotSingleGroup()
# setting default outfmt to df.
# setting default input_type to picard.

In the cohort1_plot list (which is actually an R list), you will find three ggplot2 objects: prop_plot, cdf_plot, and one_minus_cdf_plot. This allows you to modify these plots using ggplot2 syntax.

4.4. Custom Plots 

In this session, We will demonstrate how you can manipulate ggplot2 objects and create multiple panels in a figure. The explanation for calculating fragment size metrics is detailed below:

Upper Panels (Proportion Plot):

x-axis: Fragment length (from 30bp, 31bp, …, 500bp), represented as discrete integers.

x-axis Range: We plot fragment lengths from 30bp to 500bp because the proportion of fragments < 30bp is noisy and the proportion of fragments > 500bp is usually too low to be useful. If users want to expand the x-axis, parameters can be set in the callSizefunctions.

y-axis: Proportion of reads with a specific read length. Note: The line is not a smoothed curve; it is a line connecting different data points.

Lower Panels (Cumulative Plot):

x-axis: Fragment length (from 30bp, 31bp, …, 500bp), represented as discrete integers.

y-axis: Count the number of reads ≤ N bp (e.g., 30bp). Normalize the count to a proportion. Repeat steps 1 and 2 for each fragment size (i.e., 30bp, 31bp, …, 500bp) count. Note: This is not a smoothed curve; it is a line connecting different data points.

library(scales)
library(ggpubr)
library(ggplot2)
library(dplyr)


# Define a list for the groups/cohorts.
grp_list <- list("cohort_1" = "cohort_1",
            "cohort_2" = "cohort_2",
            "cohort_3" = "cohort_3",
            "cohort_4" = "cohort_4")

# Generating the plots and store them in a list.
result <- sapply(grp_list, function(x){
    result <- callSize(path = data_path) %>%
        dplyr::filter(group == as.character(x)) %>%
        plotSingleGroup()
    }, simplify = FALSE
)
# setting default outfmt to df.
# setting default input_type to picard.
# setting default outfmt to df.
# setting default input_type to picard.
# setting default outfmt to df.
# setting default input_type to picard.
# setting default outfmt to df.
# setting default input_type to picard.

# Multiplexing the plots in one figure
suppressWarnings(
multiplex <-
    ggarrange(result$cohort_1$prop_plot +
            theme(axis.title.x = element_blank()),
            result$cohort_4$prop_plot +
            theme(axis.title = element_blank()),
            result$cohort_1$cdf_plot,
            result$cohort_4$cdf_plot +
            theme(axis.title.y = element_blank()),
            labels = c("Cohort 1 (n=5)", "Cohort 4 (n=4)"),
            label.x = 0.2,
            ncol = 2,
            nrow = 2))

multiplex

4.5. Plotting Median Fragment Size 

In the last session, we plotted all samples in each group. Now, we will calculate the median fragment size distribution for each group and then plot these median distributions together.

# Set an order for those groups (i.e. the levels of factors).
order <- c("cohort_1", "cohort_2", "cohort_3", "cohort_4")
# Generate plots.
compare_grps <- callMetrics(data_path) %>% plotMetrics(order=order)
#  setting default input_type to picard.

# Modify plots.
p1 <- compare_grps$median_prop_plot +
ylim(c(0, 0.028)) +
theme(axis.title.x = element_blank(),
        axis.title.y = element_text(size = 12,face = "bold")) +
theme(legend.position = c(0.7, 0.5),
        legend.text = element_text(size = 11),
        legend.title = element_blank())
# Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
# 3.5.0.
# ℹ Please use the `legend.position.inside` argument of `theme()` instead.
# This warning is displayed once every 8 hours.
# Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
# generated.

p2 <- compare_grps$median_cdf_plot +
scale_y_continuous(labels = scales::number_format(accuracy = 0.001)) +
theme(axis.title = element_text(size = 12,face = "bold")) +
theme(legend.position = c(0.7, 0.5),
        legend.text = element_text(size = 11),
        legend.title = element_blank())

# Finalize plots.
suppressWarnings(
median_grps <- ggpubr::ggarrange(p1,
                    p2,
                    label.x = 0.3,
                    ncol = 1,
                    nrow = 2
                    ))


median_grps

4.7. Plotting Inter-peak/valley Distance 

To quantify the 10 bp periodical oscillations observed in cell-free DNA fragmentation patterns, we can calculate the inter-peak and inter-valley distances and plot it.

4.7.1. Inter-Peak Distance 

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")

# Plot and modify inter-peak distances.

inter_peak_dist <- callPeakDistance(path = data_path, limit = c(50, 135)) %>%
plotPeakDistance(order = order) +
labs(y = "Fraction") +
theme(axis.title =  element_text(size = 12,face = "bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
# setting the mincount to 0.
# setting the xlim to c(7,13).
# setting default outfmt to df.
# setting default mincount to 0.
# setting default input_type to picard.


# Show the plot.
suppressWarnings(print(inter_peak_dist))

4.7.2. Inter-Valley Distance 

# Set an order for your groups, it will affect the group order.
order <- c("cohort_1", "cohort_2", "cohort_4", "cohort_3")
# Plot and modify inter-peak distances.
inter_valley_dist<-callValleyDistance(path = data_path,
                                    limit = c(50, 135)) %>%
plotValleyDistance(order = order) +
labs(y="Fraction") +
theme(axis.title =  element_text(size=12,face="bold"),
        legend.title = element_blank(),
        legend.position = c(0.91, 0.5),
        legend.text = element_text(size = 11))
# setting the mincount to 0.
# setting the xlim to c(7,13).
# setting default outfmt to df.
# setting the mincount to 0.
#u setting default input_type to picard.

# Show the plot.
suppressWarnings(print(inter_valley_dist))

4.8. Additional Plots 

Further modifications to the plots generated by the cfDNAPro package can be done using ggplot2 syntax. For example, we can highlight the peaks and valleys in the fragmentation patterns.

library(ggplot2)
library(cfDNAPro)
# Set the path to the example sample.
exam_path <- examplePath("step6")
# Calculate peaks and valleys.
peaks <- callPeakDistance(path = exam_path)
#> setting default limit to c(35,135).
#> setting default outfmt to df.
#> Setting default mincount to 0.
#> setting default input_type to picard.
valleys <- callValleyDistance(path = exam_path)
# setting default limit to c(35,135).
# setting default outfmt to df.
# setting the mincount to 0.
# setting default input_type to picard.
# A line plot showing the fragmentation pattern of the example sample.
exam_plot_all <- callSize(path = exam_path) %>% plotSingleGroup(vline = NULL)
# setting default outfmt to df.
# setting default input_type to picard.
# Label peaks and valleys with dashed and solid lines.
exam_plot_prop <- exam_plot_all$prop +
coord_cartesian(xlim = c(90,135), ylim = c(0,0.0065)) +
geom_vline(xintercept = peaks$insert_size, colour = "red",linetype = "dashed") +
geom_vline(xintercept = valleys$insert_size,colour = "blue")

# Show the plot.
suppressWarnings(print(exam_plot_prop))

We can label the peaks and valleys with dots:

# Label peaks and valleys with dots.
exam_plot_prop_dot <- exam_plot_all$prop +
coord_cartesian(xlim = c(90,135),ylim = c(0,0.0065)) +
geom_point(data= peaks,
            mapping = aes(x= insert_size, y= prop),
            color="blue",alpha=0.5,size=3) +
geom_point(data= valleys,
            mapping = aes(x= insert_size, y= prop),
            color="red",alpha=0.5,size=3)
# Show the plot.
suppressWarnings(print(exam_plot_prop_dot))

This is the end of the tutorial detailing additional examples of how to use the cfDNAPro R Package.

Check out the following tutorials to become a true Pro at using the cfDNAPro package :)