Removing polymorphic probes from Affymetrix microarrays

Posted on 2nd February 2018
by Rupert Overall

Although the microarray heyday may be over, there are plenty of datasets out there that are still relevant and heavily used. The sequencing revolution has had another effect—and that is that high-quality, dense maps of single nuleotide polymorphisms (SNPs) and insertion-deletion events (indels) are now readily available. A particular problem with microarrays (if, like me, you work with genetic populations including non-reference strains) is that polymorphisms inside a microarray probe will affect its hybridisation, and thus the reported signal. This will lead to some reported expression differences that are artefacts of sequence differences rather than true strain-specific expression differences. With good SNP maps, we can address this issue.

One solution would be to remove all affected probes and just use those that target non-polymorphic sequences to estimate gene expression. This is particularly relevant for Affymetrix arrays which are designed with several (typically 11) short probes (25mers) per target gene. This means that a) even a single polymorphism will affect binding and b) we can afford to lose a probe or two from the probeset.

This post describes a script I have written, called ‘CDFSniper’, which does just this—identifies probes containing polymorphisms, removes them and packages the remaining probes into a new custom chip definition file (CDF) which can be used directly in an array preprocessing workflow.

How does it work?

The package is pretty straightforward; firstly, the function remove.probes takes two parameters; a data frame containing the genomic positions of all the probes in the CDF to be altered, and a data fame containing genomic positions of the undesired sequence features (SNP, indels or whatever you don’t want). The function then flags all of the probes which contain an undesired sequence feature and returns a list cotaining whitelisted probes (those to be kept), blacklisted probes (those to be deleted) and a summary of the numbers of altered probesets.

The genomic positions of the probes in the CDF are assumed to come from the Brainarray site (I will extend the code to accept files from different sources in a future version) and must contain the chromosome in a column called “Chr”, the start position of the probe in a column called “Chr.From” and the strand of the probe in a column called “Chr.Strand”. The parameter probe.length (default = 25) sets the length of the probe on the Affymetrix chip. The genomic positions of the undesired features should be in the BED format, which is simply three columns; “Chromosome” (character, with or without a leading ‘chr’), “Start” (numeric, the nucleotide position), and “End” (numeric, the same as Start for SNPs but can be different in the case of indels).

I have used this to remove all known DBA/2J SNPs and indels (I got these from the Sanger ftp site; I’ll put details in the vignette) from the Mouse430 v2 array. After running the remove.probes function, I could see how many probes and probesets were removed and plot the numbers of probes present before and after filtering.

filtered.probes = CDFSnipeR::remove.probes(old.probes, features)
# Have a look at the output to see what effect probe removal had.
colSums(filtered.probes$ProbeCounts, na.rm=TRUE) # Total numbers of probes
colSums(filtered.probes$ProbeCounts > 0, na.rm=TRUE) # Total numbers of probesets
plot(filtered.probes$ProbeCounts, xlab="Brainarray v22", ylab="Filtered probes", main="Number of probes per probeset")

A plot of the number of probes in each probeset before vs. after SNP cleaning.

Once this is done, the resulting filtered probes can be passed to the actual CDF reconstruction code. The function cdf.sniper takes two main arguments; the file path (or file connection) to the CDF to be altered and the whitelisted probes generated above from the remove.probes function (filtered.probes$FilteredProbes in the example above). The CDF is read from file into a string vector and taken apart into its component sections. The files can be quite large and the information is stored in a slightly (to me at least) counterintuitive way but, in essence, each probeset is defined as a collection of probes. Remapping and removal of probes obviously also affects probeset composition so that some probesets may have hundreds of probes while others have none, or almost none. A parameter, min.probeset.size, sets how many probes must be in a probeset. If the number of remaining probes falls below this, then the entire probeset is discarded from the new CDF. I set this by default at 3 probes minimum per probeset. This step is quite time-consuming and I will work to improve efficiency in future versions. Specifically, I have not made use of the ‘affxparser’ package (I did not know of this when I wrote the first version of CDFSnipeR), which may speed up reading and processing of the CDF. However, because each of the probeset definintions is processed separately, this can easily be done in parallel. The cdf.sniper function therefore also accepts a cluster object (e.g. from the parallel package) which, if supplied, will be used to distribute the bulk of the code to multiple CPUs.

The output of cdf.sniper is a character vector which can simply be writen to file using write. This text file can be bundled into an R package using tools from the makecdfenv package. The code I used to build my DBA/2J-SNP-free CDF package was:

  packagename  = 'mouse4302mmensgdesnpcdf',
  cdf.path     = getwd(),
  package.path = getwd(),
  compress     = FALSE,
  author       = 'Rupert W Overall',
  maintainer   = 'Rupert Overall <emailaddress>',
  version      = '22.0.0',
  species      = 'Mus_musculus',
  unlink       = TRUE,
  verbose      = TRUE

The code for CDFSnipeR has been built into an R package which is available from my website: CDFSnipeR I will work on a vignette that walks through a worked example and I will also make this available here.

For those who are interested, the CDF I made can be downloaded from my site: CDF listing and I will add some other pre-built CDFs here as I create them for various projects.

In a future post, I will run some tests to see what tangible differences tainted probe removal actually makes…