# Determining mean bond lengths from crystallographic data

The Cambridge Crystallographic Data Centre is a non-profit organisation devoted to small-molecule crystallography data. It curates, validates and distributes the Cambridge Structural Database (CSD) of over 800,000 organic and metal-organic crystal structures. The CSD has an excellent Python API which can be used to analyse these structures. Unfortunately, access to most of the CCDC data requires a paid-for licence or an institutional subscription. In the short project below I obtained the necessary crystal structures using my UCL credentials. Installation and configuration of the database and software is documented on the CCDC website.

To explore the CSD Python API a bit, I thought I'd determine a few carbon-carbon bond lengths. The API allows one to search the database from Python, but apparently does not support Python 3 (yet?), so to minimize the amount of new code I write in Python 2, I downloaded the CSD identifiers of all crystal structures containing carbon atoms using the ConQuest tool. With this file, C-containing_structures.gcd, I can extract the carbon-carbon bond lengths to a one-dimensional NumPy array saved as CC-bondlengths.npy with the following script (NB Python 2.7!)

from __future__ import print_function
import numpy as np
from ccdc import io

filename = 'C-containing_structures.gcd'
bond_lengths = []
i = 0
for component in mol.components:
for bond in component.bonds:
if len(bond.atoms) != 2:
continue
atom1, atom2 = bond.atoms
if atom1.atomic_symbol == atom2.atomic_symbol == 'C':
bond_lengths.append(np.linalg.norm(np.array(atom1.coordinates)
- np.array(atom2.coordinates)))
i += 1
if not i % 100:
print(i, mol.identifier)

bond_lengths = np.array(bond_lengths)
np.save('CC-bondlengths.npy', bond_lengths)


Note that this is pretty crude: it examines all the components of the crystal structure, so the results are going to be weighted towards structures with a higher number of molecules per unit cell (e.g. Phenol has a unit cell consisting of three $\mathrm{C_6H_5OH}$ components, all of which have pretty much the same structure).

The bond lengths can be visualized as a histogram:

import numpy as np
import matplotlib.pyplot as plt

nbins = 500
bond_dist, bins = np.histogram(bond_lengths, bins=nbins)
bin_centres = (bins[:-1] + bins[1:])/2

fig, ax = plt.subplots()
bin_width = bins[1] - bins[0]
ax.bar(bin_centres, bond_dist/1000, ec='none', width=bin_width, fc='m',
alpha=0.5)
ax.set_xlabel('Bond length /Å')
ax.set_ylabel('Number of bonds (1000s)')
plt.savefig('CC-histogram.png')
plt.show()


The two peaks correspond to double and single C–C bonds: the triple bonds are apparently much rarer and are not be seen on a linear scale. On a log scale, however, the three main types of bond are easily seen:

Average values for the different types of carbon-carbon bond in the database can be estimated from the location of the three main maxima to be as follows.

BondLength /Å
$\mathrm{C-C}$1.53
$\mathrm{C=C}$1.39
$\mathrm{C\equiv C}$1.20

Note that this is not a good way to estimate the average length of carbon-carbon bonds in general: it is clearly biased towards the aromatic bonds found in the types of organic molecules heavily represented in the database. The bond lengths found for non-delocalized $\mathrm{C=C}$ bonds are more like 1.35 Å.

Current rating: 5

### New Comment

required

required (not published)

optional

required