PDB Reader
==========

Background
----------

The Protein Data Bank (PDB) format allows for a relatively straightforward process of describing biomolecular structures. The RCSB protein data bank (RCSB PDB) has information on thousands of structures. Unfortunately, simulating these structures is not always straightforward, as their structures do not always model all relevant info explicitly. Modifying the structures can be especially difficult, even if you know what you are doing.

Most MolCube projects start with a PDB Reader project. PDB Reader parses a ``.pdb`` or ``.cif`` file, determines what structures already exist in the file, and allows manipulating the structure to add or remove features like :ref:`ssbonds`, :ref:`mutations`, :doc:`Glycosylations <glycosylation>`, etc.

A finished PDB Reader project can then be used for more complex operations like :doc:`Solvation <solution_builder>`, embedding in a :doc:`Membrane <membrane_builder>`, and calculating :doc:`free_energy`.

Overview
--------

The general procedure for using PDB Reader in |project-bf| goes like this:

#. Authenticate server connection.
#. Create project.
#. Select chains.
#. Manipulate structure (optional).
#. Finalize model.
#. Download project (optional).

The example below shows how this works in the simplest case by using only the default settings that you'd see on the MolCube Apps site. This is equivalent to entering a PDB ID and just clicking "Next" until you get to the final page, then clicking "Download Project". ::

   import molcube as mc
   from pprint import pprint

   molcube = mc.API('alphaapi.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   #
   # Initialize project by downloading structure from RCSB
   #
   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test-defaults', ff='charmmff', pdbId='2hac')
   pdbreader.set_defaults()

   #
   # if modifying chain selection, do so here
   #
   assert pdbreader.confirm_chains()

   #
   # if modifying manipulation options, do so here
   #
   assert pdbreader.model_pdb()

   pdbreader.download_project('myproject.tgz')

The ``assert`` keyword is prepended to each command that submits a step. This prevents the script from proceeding if a step fails, though it is not required.

Create PDB Reader project
-------------------------

Creating a PDB Reader project requires setting a force field and project title, and either providing a RCSB ``pdbId`` or uploading a ``customPdb``. Alternatively, if your project already exists, e.g. because you created it interactively in MolCube Apps, you can :ref:`resume_project`.

Let’s walk through how to create a PDB Reader project.

The following arguments are available for the method:

* ``correct_topo (bool)``: Correct chains and bonds information using distance between each atom. (default: False)
* ``rename_dupl_atoms (bool)``: Rename hetero atoms if there are duplicate atom names. (default: True)
* ``calc_pka (bool)``: Calculate pKa of protein residues to apply system pH. (defualt: False)

Two force field options are available: charmmff and amberff. Although MolCube supports martiniff and drudeff, these options are not yet supported in the Python API client.

For amberff, you can select different force field options. The default selections for amberff are as follows::

   amberOptions = {
       "protein": "FF19SB",
       "dna": "OL15",
       "rna": "OL3",
       "glycan": "GLYCAM_06j",
       "lipid": "Lipid21",
       "water": "OPC"
   }

Here are all available choices for amberOptions::

    Protein: [FF19SB, FF14SB, FF14SBonlysc]
    DNA: [OL15, BSC1]
    RNA: [OL3, YIL, Shaw]
    Glycan: [GLYCAM_06j]
    Lipid: [Lipid21, Lipid17]
    Water: [OPC, TIP3P, TIP4PEW, TIP4PD]

You can also find these options in the ``molcube.pdbreader.enums`` module::

   from molcube.pdbreader import enums

   print(f"Protein options: [{', '.join(enums.Protein)}]")
   print(f"DNA options: [{', '.join(enums.DNA)}]")
   print(f"RNA options: [{', '.join(enums.RNA)}]")
   print(f"Glycan options: [{', '.join(enums.Glycan)}]")
   print(f"Lipid options: [{', '.join(enums.Lipid)}]")
   print(f"Water options: [{', '.join(enums.Water)}]")

Fetch PDB from RCSB
^^^^^^^^^^^^^^^^^^^

Using the ``pdbId`` keyword argument will attempt to obtain the PDB from RCSB automatically. ``create_project()`` returns ``True`` on success::

   # Create a PDB Reader project and fetch PDB from RCSB using PDB ID
   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test', ff='charmmff', pdbId="2hac")

Upload a custom PDB file
^^^^^^^^^^^^^^^^^^^^^^^^

If you already have a local copy of your structure, you can pass the path to your structure with the ``customPdb`` keyword argument::

   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test', ff='charmmff', customPdb="files/2hac.cif")

MolCube recognizes structures in PDB (``.pdb``), PDBx/mmCIF (``.cif``), and GROMACS (``.gro``) formats.

.. _resume_project:

Resume an existing project
^^^^^^^^^^^^^^^^^^^^^^^^^^

An existing project can be resumed by passing the project ID to ``resume_project()``::

   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.resume_project(project_id='b33384ed-7e4e-48cb-9afd-b2f0fe6456a2')

Search project list
^^^^^^^^^^^^^^^^^^^

If you don't already know the ID, you can use ``search_projects()``.

Acceptable args:

* ``page (int)`` page to return (default: 1)
* ``perPage (int)`` number of results per page (default: 10)
* ``keyword (str)`` limit results to those containing a keyword
* ``searchKey (title|pk)`` restrict keyword search to either the title string or project ID (pk).

Other args: ``projectStatus (str)``, ``projectStep (int)``, ``projectCategory (designer|builder)``, ``forceField (str)``, ``startDate (datetime)``, ``endDate (edatetime)``, ``hasStandaloneLigand (bool)``, ``pdbAmberOption (dict)``. See :doc:`../api_reference/index` for more detail.

Example::

   >>> search_results = molcube.search_projects()
   >>> search_results
   {'projects': [{'pk': '1eebb792-f267-4c01-9f9f-1b179819c3f3',
      'createdAt': '2026-04-02 13:51:33',
      'forcefieldType': 'charmmff',
      'projectType': 'PDB Reader',
      'title': 'My Test Project',
      'step': 2,
      'status': 'Success',
      'fileName': '2hac.cif',
      'sideChainOriented': False,
      'tag': None,
      'user': 'Your User Name',
      'team': None,
      'teamId': None,
      'workspace': 'Personal'},
     {'pk': '205dffb4-2b15-43ce-9228-305e6ae510a6', ...},
     ...,
    ],
    'totalPages': 2,
    'currentPage': 1,
    'totalCount': 11,
    'hasNext': True,
    'hasPrevious': False}

E.g., to resume the most recent project, use the first ``pk`` from the returned object::

   my_projects = search_results['projects']
   project_id = my_projects[0]['pk']

   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.resume_project(project_id=project_id)

Check available info about PDB
------------------------------

The ``get_chains()`` method returns a list of chains for each chain type and (where applicable) the available terminal caps::

   >>> pdbreader.get_chains()
   {'protein': [{'chainIndex': 'PROT_A',
      'terminal': {'nter': ['NTER', 'NNEU', 'ACE', 'NONE'],
       'cter': ['CTER', 'CNEU', 'CT1', 'CT2', 'CT3', 'NONE']},
      'nsdTerminal': {'nter': ['ACE'], 'cter': ['NONE']},
      'chainId': 'A'},
     {'chainIndex': 'PROT_B',
      'terminal': {'nter': ['NTER', 'NNEU', 'ACE', 'NONE'],
       'cter': ['CTER', 'CNEU', 'CT1', 'CT2', 'CT3', 'NONE']},
      'nsdTerminal': {'nter': ['ACE'], 'cter': ['NONE']},
      'chainId': 'B'}],
    'nucleicAcid': [],
    'standaloneLigand': [],
    'heme': [],
    'ion': [],
    'glycan': [],
    'water': []}

The ``get_pdb_info()`` method returns a large dict with all info the MolCube server was able to parse from the structure::

   >>> pdb_info = pdbreader.get_pdb_info()
   >>> pdb_info.keys()

   dict_keys(['ph', 'pdbId', 'source',
      'forceFieldType', 'models', 'availResnames',
      'resnames', 'titrableResidues',
      'ptmResidues', 'protonationStates',
      'ssbondResidues', 'phosphorylatableResidues',
      'phosphorylationStates', 'staplingPatches',
      'missingResidues', 'ssbonds',
      'glycosylations', 'hemes', 'staplings',
      'covalentLigands', 'nonStandards',
      'terminalCappings', 'acidsOptions',
      'surfaceProteinResidues', 'calcPka',
      'invalidCovalentLigands', 'selectedChains',
      'ffGeneration', 'ffGenAtomType'])

Check default settings
----------------------

To see what settings MolCube would use by default, use `get_defaults()`::

   >>> pdbreader.get_defaults()

   {'projectPk': '1eebb792-f267-4c01-9f9f-1b179819c3f3',
    'ph': 7.0,
    'chain': {'calcPka': False,
     'ffGeneration': None,
     'ssbond': [{'residue1': {'chainIndex': 'PROT_A', 'resid': '2'},
       'residue2': {'chainIndex': 'PROT_B', 'resid': '2'}}],
     'glycosylation': [],
     'heme': [],
     'protein': [{'chainIndex': 'PROT_A', 'missing': [], 'selected': True},
      {'chainIndex': 'PROT_B', 'missing': [], 'selected': True}],
     'nucleicAcid': [],
     'standaloneLigand': [],
     'ion': [],
     'glycan': [],
     'water': [],
     'projectPk': '1eebb792-f267-4c01-9f9f-1b179819c3f3',
     'ph': 7.0},
    'glycosylation': [],
    'ffGeneration': None,
    'ssbond': [{'residue1': {'chainIndex': 'PROT_A', 'resid': '2'},
      'residue2': {'chainIndex': 'PROT_B', 'resid': '2'}}],
    'heme': []}

This is the format of the request that will be sent to the server if you use the defaults. To tell the ``pdbreader`` object to set its internal settings to match the server's defaults, use ``set_defaults()`` with no arguments::

   # you do not need to call get_defaults() if you
   # are not going to edit the defaults dict manually
   pdbreader.set_defaults()

While you could edit the manipulation options dict directly and pass the result as an argument to ``set_defaults()``, it is easier to use the dedicated manipulation methods, which are demonstrated below.

A quick summary of supported manipulations:

* Chain selection: ``toggle_chain()``, ``toggle_chains_by_type()``
* Terminal patching: ``set_terminal_patch()``, ``get_terminal_residues()``
* Mutations: ``add_mutation()``, ``remove_mutation()``
* Phosphorylation: ``add_phosphorylation()``, ``remove_phosphorylation()``
* Protonation: ``add_protonation()``, ``remove_protonation()``
* Disulfide bonds: ``add_ssbond()``, ``remove_ssbond()``
* Peptide stapling: ``add_staple()``, ``remove_staple()``, ``get_valid_staples()``
* Missing residue modeling: ``add_missing_residues()``, ``remove_missing_residues()``, ``get_valid_missing_terminals()``
* Side chain orientation: ``orient_side_chains()``
* Glycosylation: See examples in :doc:`glycosylation`.

Chain selection
---------------

The chain selection functions are ``toggle_chain()`` and ``toggle_chains_by_type()``. The default chain selection when using ``set_defaults()`` is to select all chains except for water. You only need to call one of the these methods if deviating from this default.

Both functions take up to two arguments::

    toggle_chain() args:
        enable (str | list[str]): chain ID or list of chain IDs to enable
        disable (str | list[str]): chain ID or list of chain IDs to disable

    toggle_chains_by_type() args:
        enable (str | list[str]): a category or list or categories to enable
        disable (str | list[str]): same as above

Chain types:

* protein
* nucleicAcid
* standaloneLigand
* ion
* water
* glycan

See ``molcube.pdbreader.enums.CHAIN_TYPE``::

   >>> from molcube.pdbreader import enums
   >>> print(f"Chain types: [{', '.join(enums.CHAIN_TYPE)}]")
   Chain types: [protein, nucleicAcid,
      standaloneLigand, heme, ion, water, glycan]

Usage example
-------------

3PQR has several chains, as shown below::

   import molcube as mc
   from pprint import pprint

   molcube = mc.API('api.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   pdbreader = molcube.create_pdb_reader_project()
   pdbreader.create_project(title='test-chains', ff='charmmff', customPdb='files/3pqr.cif')
   chains = pdbreader.get_chains()
   pdbreader.set_defaults()

   >>> pdbreader
   <PdbReaderProject with settings: {
       "projectPk": "5fcd30b3-281c-4962-8dec-3bdda99baaa6",
       "ph": 7.0,
       "chain": {
           "calcPka": false,
           "ssbond": [ ... ],
           "glycosylation": [ ... ],
           "protein": [
               { "chainIndex": "PROT_A", "missing": [], "selected": true, "terminal": { "nter": "NTER", "cter": "CTER" } },
               { "chainIndex": "PROT_B", "missing": [], "selected": true, "terminal": { "nter": "NTER", "cter": "CTER" } } ],
           "nucleicAcid": [],
           "standaloneLigand": [
               { "chainIndex": "HETE_C", "selected": false },
               { "chainIndex": "HETE_D", "selected": false },
               { "chainIndex": "HETE_E", "selected": false } ],
           "ion": [],
           "glycan": [
               { "chainIndex": "GLYC_A", "selected": true },
               { "chainIndex": "GLYC_B", "selected": true },
               { "chainIndex": "GLYC_C", "selected": true },
               { "chainIndex": "GLYC_D", "selected": true },
               { "chainIndex": "GLYC_E", "selected": true } ],
           "water": [
               { "chainIndex": "WATE_A", "selected": false },
               { "chainIndex": "WATE_B", "selected": false } ],
           "ph": 7.0
       },
       "glycosylation": [ ... ], "ffGeneration": null,
       "ssbond": [ ... ],
   }>

Chains can be enabled/disabled individually or by category::

   # enable or disable a single chain
   pdbreader.toggle_chain(enable='PROT_A')
   pdbreader.toggle_chain(disable='GLYC_A')
   # same as above
   pdbreader.toggle_chain(enable='PROT_A', disable='GLYC_B')

   # enable or disable multiple chains
   pdbreader.toggle_chain(enable=['PROT_A', 'GLYC_C'],
                          disable='PROT_B')

   # disable everything except protein
   pdbreader.toggle_chains_by_type(enable='protein',
      disable=['glycan', 'water', 'ion', 'standaloneLigand'])

Confirm chain selection (required)
----------------------------------
After using ``set_defaults()`` and (optionally) a toggle function, your settings are still local to your machine. To tell the MolCube server to apply your chain selection, you must use the ``confirm_chains()`` method.

Args it accepts:

* ph (float): pH to use (default: 7.0)
* model (int): PDB model to use (default: 1st model)

This is equivalent to pressing "Submit" on the chain selection page of PDB Reader::

   assert pdbreader.confirm_chains()

Model manipulation (required)
-----------------------------

The sections below demonstrate model manipulation. Each of them is optional.

Use ``model_pdb()`` to confirm model manipulation. This must be performed _after_ ``confirm_chains()``.

In the simplest case where you want to use default chain selection *and* default manipulations, then this is all you need to do::

   import molcube as mc
   from pprint import pprint

   molcube = mc.API('api.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   # simplest possible case: use defaults for everything
   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test-defaults', ff='charmmff', customPdb='files/2hac.cif')
   pdbreader.set_defaults()

   #
   # if modifying chain selection, do so here
   #

   assert pdbreader.confirm_chains()

   #
   # if modifying manipulation options, do so here
   #

   assert pdbreader.model_pdb()

Terminal patching
-----------------

Each protein chain returned by ``get_chains()`` has a ``'terminal'`` key with a list of valid N-/C-terminals. The default terminal patch is the first one in the list. E.g., defaults for PROT_A below are CTER and NTER::

   >>> from pprint import pprint
   >>> pprint(chains['protein'])
   {'chainId': 'A',
     'chainIndex': 'PROT_A',
     'nsdTerminal': {'cter': ['CT3'], 'nter': ['ACE']},
     'terminal': {'cter': ['CTER', 'CNEU', 'CT1', 'CT2', 'CT3', 'NONE'],
                  'nter': ['NTER', 'NNEU', 'ACE', 'NONE']}},
    {'chainId': 'B',
     'chainIndex': 'PROT_B',
     'nsdTerminal': {'cter': ['NONE'], 'nter': ['ACE']},
     'terminal': {'cter': ['CTER', 'CNEU', 'CT1', 'CT2', 'CT3', 'NONE'],
                  'nter': ['NTER', 'NNEU', 'ACE', 'NONE']}}]
   >>> pprint({chain: pdbreader._option_by_chain[chain] for chain in ('PROT_A', 'PROT_B')})

   {'PROT_A': {'chainIndex': 'PROT_A',
               'missing': [],
               'selected': True,
               'terminal': {'cter': 'CTER', 'nter': 'NTER'}},
    'PROT_B': {'chainIndex': 'PROT_B',
               'missing': [],
               'selected': True,
               'terminal': {'cter': 'CTER', 'nter': 'NTER'}}}

To set a different terminal patch, use ``set_terminal_patch()``.

Expected arguments:

* chain_id (str, required): chain to set
* nter (str, optional): use this patch, if given; else use default patch
* cter (str, optional): use this patch, if given; else use default patch

.. _mutations:

Mutations
---------

Point mutations are added with ``add_mutation()`` and removed with ``remove_mutation()``.

Expected arguments:

   add_mutation()
       chain_id (str): chain containing residue to mutate
       resid (str): residue ID to mutate
       new_resname (str): name of residue to mutate to

   remove_mutation() is used like above, but
      requires only the chain_id and resid arguments.

Example usage::

   import molcube as mc
   from pprint import pprint

   molcube = mc.API('api.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   # simplest possible case: use defaults for everything
   pdbreader = molcube.create_pdb_reader_project()
   pdbreader.create_project(title='test-mutation', ff='charmmff', customPdb='files/2klu.cif')

   pdbreader.set_defaults()
   assert pdbreader.confirm_chains()

   pdbreader.add_mutation(chain_id='PROT_A', resid='364', new_resname='ASN')  # GLY 364 -> ASN
   pdbreader.add_mutation(chain_id='PROT_A', resid='365', new_resname='ALA')  # PRO 365 -> ALA

   assert pdbreader.model_pdb()

Phosphorylation and Protonation
-------------------------------

``add_phosphorylation()`` and ``add_protonation()`` take the same arguments and differ only by what patch residues are considered valid.

Args:

* chain_id (str): chain index, e.g. PROT_A
* resid (str): resid to protonate
* patch (str): name of phosphorylation/titration patch to apply

You can find valid options in the dict returned by ``get_pdb_info()``::

   pdb_info = pdbreader.get_pdb_info()
   print('protonations:')
   pprint( pdb_info['titrableResidues'] )
   # protonations:
   # {'ARG': ['RN1', 'RN2', 'RN3'],
   #  'ASP': ['ASPP'],
   #  'CYS': ['CYM'],
   #  'GLU': ['GLUP'],
   #  'HIE': ['HSP', 'HSD', 'HSE'],
   #  'HIP': ['HSP', 'HSD', 'HSE'],
   #  'HIS': ['HSP', 'HSD', 'HSE'],
   #  'HSD': ['HSP', 'HSE'],
   #  'HSE': ['HSP', 'HSD'],
   #  'HSP': ['HSD', 'HSE'],
   #  'LYS': ['LSN']}

   print()
   print('phosphorylations:')
   pprint( pdb_info['phosphorylatableResidues'] )
   # phosphorylations:
   # {'SER': ['SP2', 'SP1'], 'THR': ['THPB',
   #  'THP1'], 'TYR': ['TP2', 'TP1']}

Example usage::

   import molcube as mc

   molcube = mc.API('api.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test-phosphorylation-2klu', ff='charmmff', customPdb='files/2klu.cif')

   pdbreader.set_defaults()
   assert pdbreader.confirm_chains()

   pdbreader.add_phosphorylation(chain_id='PROT_A', resid='394', patch='SP1')
   pdbreader.add_phosphorylation(chain_id='PROT_A', resid='415', patch='SP1')
   pdbreader.add_phosphorylation(chain_id='PROT_A', resid='431', patch='SP1')
   assert pdbreader.model_pdb()

.. _ssbonds:

Disulfide bonds
---------------

Disulfide bonds present in the PDB are shown in ``pdb_info``::

   >>> pdb_info = pdbreader.get_pdb_info()
   >>> pdb_info['ssbonds']
   {'residue1': {'chainIndex': 'PROT_A', 'resid': '2'},
     'residue2': {'chainIndex': 'PROT_B', 'resid': '2'}}]

Disulfide bonds are added with `add_ssbond()` and removed with `remove_ssbond()`.

Both require the same arguments::

    residue1 (str | dict): first ssbond residue
    residue2 (str | dict): second ssbond residue

The two acceptable formats are shown below::

   # string format: "chain_id residue_id"
   pdbreader_project.add_ssbond('PROT_A 50', 'PROT_A 62')

   # if passing dict, it must be structed like below
   pdbreader_project.add_ssbond(
       residue1={'chainIndex': 'PROT_A', 'resid': '50'},
       residue2={'chainIndex': 'PROT_A', 'resid': '62'})

Stapling
--------

Staples are added with ``add_staple()`` and removed with ``remove_staple()``. Usage is almost exactly like with disulfide bonds, except for an additional argument when adding a staple: the staple type.

To see the valid staple types, use ``get_valid_staples()``::

   >>> pdbreader.get_valid_staples()
   ['META3', 'META4', 'META5', 'META6', 'META7',
    'RMETA3', 'RMETA4', 'RMETA5', 'RMETA6',
    'RMETA7', 'DIBM', 'DIBP', 'CR12', 'CR21']

Example usage::

   import molcube as mc

   molcube = mc.API('api.molcube.com', 443)
   molcube.authenticate(api_token=api_access_key)

   pdbreader = molcube.create_pdb_reader_project()
   assert pdbreader.create_project(title='test-staple', ff='charmmff', customPdb='files/1ubq.pdb')

   pdbreader.set_defaults()
   pdbreader.add_staple('RMETA3', 'PROT_A 1', 'PROT_A 3')

   assert pdbreader.confirm_chains()
   assert pdbreader.model_pdb()