A transcription factor binding atlas for photosynthesis in cereals identifies a key role for coding sequence in the regulation of gene expression

The gene regulatory architecture associated with photosynthesis is poorly understood. Most plants use the ancestral C3 pathway, but our most productive cereal crops use C4 photosynthesis. In these C4 cereals, large-scale alterations to gene expression allow photosynthesis to be partitioned between cell types of the leaf. Here we provide a genome-wide transcription factor binding atlas for grasses that operate either C3 or C4 photosynthesis. Most of the >950,000 sites bound by transcription factors are preferentially located in genic sequence rather than promoter regions, and specific families of transcription factors preferentially bind coding sequence. Cell specific patterning of gene expression in C4 leaves is associated with combinatorial modifications to transcription factor binding despite broadly similar patterns of DNA accessibility between cell types. A small number of DNA motifs bound by transcription factors are conserved across 60 million years of grass evolution, and C4 evolution has repeatedly co-opted at least one of these hyper-conserved cis-elements. The grass cistrome is highly divergent from that of the model plant Arabidopsis thaliana.


Summary 30
The gene regulatory architecture associated with photosynthesis is poorly understood. Most plants 31 use the ancestral C 3 pathway, but our most productive cereal crops use C 4 photosynthesis. In 32 these C 4 cereals, large-scale alterations to gene expression allow photosynthesis to be partitioned 33 between cell types of the leaf. Here we provide a genome-wide transcription factor binding atlas for 34 grasses that operate either C 3 or C 4 photosynthesis. Most of the >950,000 sites bound by 35 transcription factors are preferentially located in genic sequence rather than promoter regions, and 36 specific families of transcription factors preferentially bind coding sequence. Cell specific patterning 37 of gene expression in C 4 leaves is associated with combinatorial modifications to transcription 38 factor binding despite broadly similar patterns of DNA accessibility between cell types. A small 39

Introduction 44
Photosynthesis sets maximum crop yield, but despite millions of years of natural selection is not 45 optimised for either current atmospheric conditions or agricultural practices Ort 46 et al., 2015). The majority of photosynthetic organisms, including crops of global importance such 47 as wheat, rice and potato use the C 3 photosynthesis pathway in which Ribulose-Bisphosphate 48 Carboxylase Oxygenase (RuBisCO) catalyses the primary fixation of CO 2 . However, carboxylation 49 by RuBisCO is competitively inhibited by oxygen binding the active site (Bowes et al., 1971). This 50 oxygenation reaction generates toxic waste-products that are recycled by an energy-demanding 51 series of metabolic reactions known as photorespiration (Bauwe et al., 2010;Tolbert, 1971). The 52 ratio of oxygenation to carboxylation increases with temperature (Jordan and Ogren, 1984;53 Sharwood et al., 2016) and so losses from photorespiration are particularly high in the tropics. 54 When oxygenation is reduced through CO 2 enrichment, crops show increased photosynthetic 55 efficiency and higher yields (Leakey et al., 2012). In addition to the inefficiency associated with 56 oxygenation by RuBisCO, due to the rapid rise in atmospheric CO 2 concentrations from ~220 to 57 400ppm, the stoichiometry and kinetics of other photosynthesis enzymes are considered sub-58 optimal. For example, increased activity of Sedoheptulose 1,7-bisphosphatase improves 59 photosynthesis and yield (Lefebvre et al., 2005;Miyagawa et al., 2001). Furthermore, the ability of 60 leaves to harness light and generate chemical energy is neither optimised for current crop canopy 61 structures (Zhu et al., 2010b) or rapid fluctuations in light (Kromdijk et al., 2016). Thus, although it 62 is clear that improving C 3 photosynthesis would drive increased crop yields, we have an 63 incomplete understanding of the genes that underpin this fundamental process. 64 4 found in over 60 independent plant lineages and so represents one of the most remarkable 73 examples of convergent evolution known to biology . In most C 4 plants the initial 74 RuBisCO-independent fixation of CO 2 and the subsequent RuBisCO-dependent reactions take 75 place in distinct cell-types known as mesophyll and bundle sheath cells, and so are associated with 76 strict patterning of gene expression between these cell-types. Although the spatial patterning of DNase I-SEQ on grasses that use either C 3 or C 4 photosynthesis, we provide comprehensive 92 insight into the transcription factor binding repertoire associated with each form of photosynthesis. 93 The data indicate that specific cell types from leaf tissue make use of a markedly distinct cis-94 regulatory code, and that transcription factor binding is more frequent within genes than promoter 95 regions. Despite significant conservation in the transcription factors families binding DNA in 96 grasses, it is also apparent that the binding sites they recognise are subject to high rates of 97 mutation. Comparison of sites bound by transcription factors in both C 3 and C 4 leaves 98 demonstrates that the repeated evolution of C 4 photosynthesis is built on combination of de novo 99 gain of cis-elements and exaptation of highly conserved regulatory elements found in the ancestral 100

A cis-regulatory atlas for monocotyledons 103
Four grass species were selected to provide insight into the regulatory architecture associated 104 with C 3 and C 4 photosynthesis. Brachypodium distachyon uses the ancestral C 3 pathway ( Figure  105 1A). Sorghum bicolor, Zea mays and Setaria italica all use C 4 photosynthesis although 106 phylogenetic reconstructions indicate that S. italica represents an independent evolutionary origin 107 of the C 4 pathway ( Figure 1A). Nuclei from leaves of S. italica (C 4 ), S. bicolor (C 4 ), Z. mays (C 4 ) 108 and B. distachyon (C 3 ) were treated with DNase I ( Figure S1) and subjected to deep sequencing. A 109 total of 799,135,794 reads could be mapped to the respective genome sequences of these species 110 (Table S1). 159,396 DNase-hypersensitive sites (DHS) of between 150-15,060 base pairs 111 representing broad regulatory regions accessible to transcription factor binding were identified from 112 all four genomes ( Figure 1B). Between 20,817 and 27,746 genes were annotated as containing at 113 least one DHS (Table S2). Within these DHS, 533,409 digital genomic footprints (DGF) 114 corresponding to individual transcription factor binding sites of between 11 and 25 base pairs were 115 identified through differential accumulation of reads mapping to positive or negative strands around 116 transcription factor binding sites ( Figure 1B&C). At least one transcription factor footprint was 117 identified in >75% of the broader regions defined by DHS (Table S2). In contrast to the preferential 118 binding of transcription factors to AT-rich DNA observed in A. thaliana all four grasses DGF had a 119 greater GC content compared with the genome average (Table S3). 120 DHS and DGF were primarily located in gene-rich regions, and depleted around centromeres 121 ( Figure 1D). Individual transcription factor binding sequences were resolved in all chromosomes 122 from each species ( Figure 1D). Many genes contained DGF that have previously been associated 123 with specific classes of transcription factors. For example, the SbPPC gene (Sobic.010G160700) 124 encoding phosphoenolpyruvate carboxylase that catalyses the first committed step of C 4 125 photosynthesis, contained sixteen DGF of which six are associated with known transcription factor 126 families ( Figure 1D). On a genome-wide basis, the distribution of DGF was similar between 127 species, with the highest proportion of such sites located in promoter, coding sequence (CDS) and 128 6 coding sequences (CDS) and 3' UTRs ( Figure 1F). In all four species promoter regions contained 131 fewer DGF than genic sequence ( Figure 1F) and distribution plots showed that the density of DGF 132 was highest in exonic sequences downstream of the annotated transcriptional start sites ( Figure  133 S2). 134 135 A distinct cis-regulatory lexicon for specific cells within the leaf 136 The above analysis provides a genome-wide overview of the cis-regulatory architecture 137 associated with photosynthesis in leaves of grasses. However, as with other complex multicellular 138 systems, leaves are composed of many specialised cell types. Because each DGF is defined by 139 fewer sequencing reads mapping to the genome compared with the larger DHS region, depleted 140 signals derived from low abundance cell-types cannot be detected from such tissue-level analysis 141 ( Figure 2A). Since bundle sheath strands can be separated easily (Covshoff et al., 2013) leaves of 142 C 4 species provide a simple system to study transcription factor binding in specific cells ( Figure  143 2B). After bundle sheath isolation from S. bicolor, S. italica and Z. mays a total of 129,137 DHS 144 were identified ( Figure 2B; Table S4) containing 452,263 DGF ( Figure 2B; Table S4; FDR<0.01). 145 Of the 452,263 DGF identified in bundle sheath strands, 170,114 were statistically enriched in the 146 bundle sheath samples compared with whole leaves ( Figure 2B; Table S4). The number of these 147 statistically enriched DGF in bundle sheath strands of C 4 species was large and ranged from 148 15,880 to 85,256 in maize and S. italica respectively ( Figure S3). Genome-wide, the number of 149 broad regulatory regions defined by DHS in the bundle sheath that overlapped with those present 150 in whole leaves ranged from 71 to 84% in S. italica and S. bicolor respectively (Table S5). 151 However, only 8-23% of the narrower DGF found in the bundle sheath were also identified in whole 152 leaves (Table S6). Taken together, these findings indicate that specific cell types of leaves share 153 considerable similarity in the broad regions of DNA that are accessible to transcription factors, but 154 that the short sequences actually bound by transcription factors vary dramatically. 155 To provide evidence that DGF predicted after analysis of separated bundle sheath strands are 156 of functional importance, they were compared with previously validated cis-elements. In C 4 (Giuliano et al., 1988) and a HOMO motif (CCTTTTTTCCTT) is important in driving bundle sheath 160 expression (Xu et al., 2001) ( Figure 2C). Both elements were detected in our pipeline. Interestingly, 161 the HOMO motif was only bound in the bundle sheath strands ( Figure 2C), and whilst the I-box 162 was detected in both bundle sheath strands and whole leaves, the position of the DGF covering it 163 was slightly shifted between the two samples ( Figure 2C). Thus, orthogonal evidence for 164 transcription factor binding in maize supports a functional role for DGF identified by DNaseI-SEQ in 165 this study. identified, and can be exemplified using three co-linear genes found on chromosome seven of S. 171 bicolor. First, in the NADP-malate dehydrogenase (MDH) gene, which is highly expressed in 172 mesophyll cells and encodes a protein of the core C 4 cycle ( Figure S4) a broad DHS site was 173 present in whole leaves, but not in bundle sheath strands ( Figure 2D). Whilst presence of this site 174 indicates accessibility of DNA to transcription factors that could activate expression in mesophyll 175 cells, global analysis of all genes strongly and preferentially expressed in bundle sheath strands 176 indicates that presence/absence of a DHS in the bundle sheath compared with the whole leaf is 177 not sufficient to generate cell specificity ( Figure S5, S6). Second, in the next contiguous gene that 178 encodes an additional isoform of MDH that is also preferentially expressed in mesophyll cells 179 ( Figure S4) a DHS was found in both whole leaf and bundle sheath strands but DGF occupancy 180 within this region differed between cell types ( Figure 2D). Thus, despite similarity in DNA 181 accessibility, the binding of particular transcription factors was different between cell types. 182 However, once again, genome-wide analysis indicated that alterations to individual DGF were not 183 sufficient to explain cell specific gene expression. For example, fewer than 30% of all enriched 184 DGF in the bundle sheath were associated with differentially expressed genes (Table S7). Lastly, 185 in the third gene in this region, which encodes a NAC domain transcription factor preferentially 186 expressed in bundle sheath strands, differentially enriched DGF were associated both with regions 8 leaves compared with bundle sheath strands ( Figure 2D). These three classes of alteration to 189 transcription factor accessibility and binding were detectable in genes encoding core components 190 of the C 4 cycle ( Figure 2E, Figure S7) implying that a complex mosaic of altered transcription factor 191 binding mediates the cell specific expression found in the C 4 leaf. 192 Overall, we conclude that differences in transcription factor binding between cells is associated 193 with both DNA accessibility defined by broad DHS, as well as fine-scale alterations to transcription within individual species fewer motifs were detected and so de novo prediction was used to identify 207 sequences over-represented in DGF compared with those across the whole genome. This resulted 208 in an additional 524 novel motifs being annotated ( Figure 3A). Inspection of these motifs predicted 209 de novo demonstrated clear strand bias in DNase-I cuts ( Figure 3B) as would expected from bone 210 fide transcription factor binding. By combining known and de novo motifs, the percentage of DGF 211 that could be annotated in each species increased to more than 41% ( Figure 3C). The relatively 212 high number of motifs defined by transcription factor binding sites predicted de novo is presumably 213 due to the significant evolutionary time since grasses diverged from A. thaliana. 214 To define the most common motifs actually bound by transcription factors in mature leaves 215 grasses was similar ( Figure 3D). Visualisation of transcription factor families associated with these 218 DGF in word clouds showed that the most prevalent motifs are associated with the AP2-EREBP 219 and C2H2 transcription factor families ( Figure 3D). These findings indicate that across these four 220 grasses the most commonly bound transcription factor motifs are highly conserved. There was 221 much less conservation between transcription factor binding sites in photosynthetic leaves of these 222 monocotyledons compared with the dicotyledon A. thaliana ( Figure 3D). This finding combined with 223 the large number of motifs from A. thaliana not detected in the grasses ( Figure 3A) argue for 224 significant divergence in the cistromes of monocotyledons and dicotyledons. 225 To investigate whether particular classes of transcription factor binding motifs are associated 226 with specific genomic features, the proportion of each motif found in promoter elements, 5' UTRs, 227 coding regions, introns and 3' UTR sequences was defined ( Figure S8). In most cases, the 228 distribution of individual motifs was similar in all genomic features, however it was noticeable that a 229 set of motifs was particularly common in coding sequence ( Figure S8). Clustering analysis 230 indicated that a set of 96 transcription factor motifs were strongly associated with coding 231 sequences in all four grass species ( Figure S9B, S10). The clear strand-bias indicates strong 232 protein-DNA interaction centred on these motifs within coding sequences ( Figure S9C). Sequences 233 that carry out a dual role in both coding for amino acids and in transcription factor binding have 234 been termed duons. Thus, in grasses it appears that duons are recognised by a specific set of 235 transcription factors. 236 To identify regulatory factors associated with gene expression in the C 4 bundle sheath, 237 transcription factor motifs located in DGF enriched in either the bundle sheath or in whole leaf 238 samples of S. bicolor were identified ( Figure 3E). There was little difference in the ranking of the 239 most prevalent commonly used motifs between these cell types ( Figure 3E&F). For example, the 240 AP2-EREBP and C2H2 families were dominant in both bundle sheath and whole leaf samples, 241 indicating that cell-specificity is not associated with large-scale changes in the relative importance 242 of transcription factor binding. However, in terms of prevalence, a small number of transcription 243 factor binding motifs were ranked in the top 50% in whole leaves but the bottom 50% in bundle 244 sheath strands ( Figure 3F). This finding implies that quantitative modifications to the use of 10 particular transcription factor families are associated with the spatial patterning of gene expression 246 that is a hallmark of C 4 photosynthesis. 247 Further analysis revealed that in all three C 4 species, motifs recognised by C2C2GATA, bZIP, 248 bHLH, BZR and TCP transcription factors were enriched in whole leaf samples, whereas those 249 bound by ARID transcription factors were enriched in the bundle sheath ( Figure 3G and Table S9). 250 Moreover, analysis of the cell-specific transcript accumulation of members of the C2C2-GATA 251 family, revealed one orthologue which was consistently mesophyll enriched in all three C 4 species 252 (GRMZM2G379005, Seita.1G358400, Sobic.004G337500; Figure 3H). Thus, these data implicate 253 these transcription factor families in controlling cell-specific gene expression in C 4 leaves, and 254 indicate that in some cases, separate C 4 lineages appear to be using orthologous transcription 255 factors to drive cell specific expression. Thus, comparative analysis of these species should provide insight into the extent to which the cis-264 regulatory architecture is conserved in the grasses, and how it has been modified during the 265 evolution of C 4 photosynthesis. 266 In pairwise comparisons of the four species, DGF fell into three categories: those for which 267 homologous sequences were both present and bound by a transcription factor (conserved and 268 occupied), those for which homologous sequences were present but were only bound by a 269 transcription factor in one species (conserved but not occupied) and those for which no sequence 270 homology could be found (not conserved) ( Figure 4A). Only a small percentage of DGF were both 271 conserved in sequence and bound by transcription factors ( Figure 4B, Table S8). DGF that were 272 conserved but unoccupied were the next most abundant group ( Figure 4B) but the majority of DGF were not conserved ( Figure 4B, Table S8). These data indicate substantial turnover in the cis-code 274 associated with the transcription factor binding repertoire of monocotyledons. 275 Consistent with the rapid turnover of DGF documented genome-wide ( Figure 4B), the majority 276 of C 4 genes did not share DGF (Table S10 and S11). However, three genes associated with the 277 core C 4 and the Calvin-Benson-Bassham cycle that are strongly expressed in either bundle sheath 278 or mesophyll cells contained the same cis-elements bound by a transcription factor in all three C 4 279 species. For example, in the 2-oxoglutarate/malate transporter (OMT1) gene, four sites defined by 280 transcription factor binding were detected in all three C 4 species ( Figure 4C). However, the position 281 of these sites within the gene varied in each species. In the transketolase (TKL) gene that is 282 preferentially expressed in bundle sheath cells, three conserved motifs defined by transcription 283 factor binding were detected in all C 4 species, but they were also all found in the C 3 species B. 284 distachyon ( Figure 4D). Thus, in some cases patterning of C 4 gene expression appears linked to 285 pre-existing regulatory architecture operating in the ancestral C 3 state, but in cases where the cis-286 regulatory code associated with C 4 gene expression is strongly conserved the position of these 287 transcription factor binding sites within any gene is variable. 288 289

Hyper-conserved cis-regulators found in coding sequences of C 4 genes 290
To investigate the extent to which transcription factor binding sites associated with C 4 genes 291 within a C 4 lineage are conserved, genes encoding the core C 4 cycle were compared in S. bicolor 292 and Z. mays ( Figure 5A). 28 genes associated with the C 4 and Calvin-Benson-Bassham Cycles 293 contained a total of 531 DGF. Although many of these transcription factor footprints were 294 conserved in sequence within orthologous genes, only twenty were both conserved and bound by 295 a transcription factor ( Figure 5A). These data therefore indicate that although many cis-elements 296 found in orthologous genes of the C 4 cycle are conserved in sequence, a small proportion were 297 bound by a transcription factor at the time of sampling. 298 Genome-wide, the number of DGF that were conserved in sequence and bound by a 299 transcription factor decayed in a non-linear manner with phylogenetic distance ( Figure 5B). For 300 example, Z. mays and S. bicolor shared 9,446 DGF that were both conserved and occupied. S.

12
C 4 grasses with C 3 B. distachyon yielded 192 DGF that have been conserved over >60Myr of 303 evolution. 95 of these highly conserved DGF were present in whole leaf samples of the C 3 species, 304 but in the C 4 species were restricted to the bundle sheath ( Figure 5B). This set of 192 ancient and 305 highly conserved DGF were located predominantly in 5' UTRs and coding sequence and strikingly, 306 in bundle sheath strands, over fifty percent of these hyper-conserved DGF were in coding 307 sequence ( Figure 5B). 308 One such hyper-conserved DGF is found in the NdhM gene that encodes a subunit of the 309 NADH complex that preferentially assembles in bundle sheath cells of C 4 plants ( Figure 5C) but it 310 is not known how this evolved. In the ancestral C 3 state a hyper-conserved DGF is found in whole 311 leaves of B. distachyon ( Figure 5D). However, in all three C 4 species rather than this DGF being 312 detected in whole leaf material, it is detected in the bundle sheath. It is also noticeable that this 313 motif has proliferated within the gene in the C 4 species compared with C 3 B. distachyon, and in 314 maize and sorghum is also found in the 5' UTR as well as coding sequence. Furthermore, in whole 315 leaf samples of these C 4 species, transcription factor binding is shifted upstream or downstream 316 ( Figure 5D). We therefore propose that preferential expression of NdhM in the bundle sheath is 317 built upon a cis-regulator present in the C 3 state that activates expression in all photosynthetic cells 318 of the leaf. During the evolution of C 4 photosynthesis, whilst accessibility of this ancient and highly 319 conserved cis-element is maintained in the bundle sheath to allow expression of NdhM, in 320 mesophyll cells an additional transcription factor(s) binds flanking sequence that blocks access to 321 this pre-existing architecture. These findings are consistent with hyper-conserved DGF located in 322 coding sequence playing an important role in the cell specific gene expression required in leaves of 323 C 4 grasses. 324 As genome-wide analysis indicated that a specific group of DGF was associated with coding 325 sequence ( Figure S8-S10) we investigated whether motifs associated with the 192 hyper-

Genome-wide transcription factor binding in grasses 336
The data presented here provide insight into genome-wide binding of transcription factors in 337 photosynthetic tissue, but also maize and sorghum which represent two of the world's most 338 productive crops. This transcription factor binding landscape shows both similarities and should be particularly prevalent in 5' UTR and coding sequences, these findings combined with the 361 available literature argue for duons and the cognate transcription factors that bind them being of 362 pervasive importance in grass genomes.

The transcription factor landscape underpinning gene expression in specific cell types 365
Given the central importance of cellular compartmentation to C 4 photosynthesis, there have 366 been significant efforts to identify cis-elements that restrict gene expression to either mesophyll or 367 bundle sheath cells of C 4 leaves (Hibberd and Covshoff, 2010;Sheen, 1999;Wang et al., 2014). 368 Along with many other systems, initial analysis focussed on regulatory elements located in 369 promoters of C 4 genes (Sheen, 1999) but, it has become increasingly apparent that the patterning  genome-wide analysis of transcription factor binding now indicates that parallel evolution of 414 transcription factors has contributed to the repeated appearance of C 4 photosynthesis. This is best 415 exemplified by the fact that in the three C 4 species that are derived from two independent C 4 416 lineages, motifs bound by the ARID and C2C2GATA classes of transcription factor were enriched 417 in bundle sheath and whole leaves respectively. In the case of the C2C2GATA family, transcripts 418 derived from one specific orthologue were more abundant in mesophyll cells of all C 4 species. recruited into functioning preferentially in one cell type, and in the case of the C2C2GATA family 421 this is associated with orthologous genes being preferentially expressed in mesophyll cells. 422 When orthologous genes were compared between genomes the majority of transcription factor 423 binding sites were not conserved. Furthermore, of the DGF that were conserved, we found that 424 their position within orthologous genes varied. This indicates that C 4 photosynthesis in grasses is 425 tolerating both a rapid turnover of the cis-code, and that when motifs are conserved in sequence, 426 their position and number within a gene can vary. It therefore appears that the cell-specific      Sitalica_164_v2; Zmays_284_AGPv3. Reads were mapped to genomes using bowtie2 (Langmead 701 and Salzberg, 2012) with the following parameters: 702 Aligned reads were then processed using samtools (Li et al., 2009) to remove those with a MAPQ 706 score <42. DHS sites were identified using a procedure adapted from the ENCODE 3 pipeline 707 Changing the file format settings to All Files, the csv file from pyDNase was loaded into TreeView, 765 from the dropdown menu entered Settings->Pixel Setting and checked all the Fill boxes, Contrast 766 Average cut density plots were generated using the script 'dnase_average_profile.py' from 768 pyDNase (Piper et al., 2013(Piper et al., , 2015  In order to convert motif files into MEME format for motif scanning a multi-step procedure was 780 necessary. Background frequency files are required when generating motifs (Thijs et al., 2001); to 781 produce background files FASTA sequences for the regions of interest (DGF) were extracted using 782 BEDTools suite (Quinlan and Hall, 2010)  to convert the TRANSFAC file to a suitable format the following code was used: 797 sed 's/0 /0.00/g' [transfac file] | sed 's/1 /1.00/g' | sed 's/2 /2.00/g' | sed 's/3 /3.00/g' | sed 's/4 798 /4.00/g' | sed 's/5 /5.00/g' | sed 's/6 /6.00/g' | sed 's/7 /7.00/g' | sed 's/8 /8.00/g' | sed 's/9 /9.00/g' | 799 sed 's/0$/0.00/g' | sed 's/1$/1.00/g' | sed 's/2$/2.00/g' | sed 's/3$/3.00/g' | sed 's/4$/4.00/g' | sed 800 's/5$/5.00/g' | sed 's/6$/6.00/g' | sed 's/7$/7.00/g' | sed 's/8$/8.00/g' | sed 's/9$/9.00/g' | sed 801 's/\P0.00 /\P0/g' > [transfac_fixed] 802 803 MEME motif files were created from TRANSFAC files using scripts from the MEME suite ( To determine overrepresentation of TF family motifs in samples hypergeometric tests were 829 performed using R with the following parameters: To cross map genomic features between species, mapping files were generated according to 847      Figure 5