Column Management in CustomDataFrame

This guide covers how to work with columns in CustomDataFrame, including adding, removing, and managing column types.

Understanding Column Specifications (colspecs)

CustomDataFrame organizes columns into different types based on their purpose:

Input Types (Immutable)

Read-only data that shouldn’t be modified after loading. Types: input, data, constants, global_consts

Output Types (Mutable, Written to Files)

Data that will be written to output files. Types: output, writedata, writeseries, parameters, globals

Cache Types (Mutable, Temporary)

Intermediate calculations that don’t need to be saved. Types: cache, series_cache, parameter_cache, global_cache

Parameter Types (Global Values)

Values that are globally valid without an index. Types: parameters, globals, parameter_cache, global_cache

Series Types (Multiple Rows Per Index)

Data that may have multiple rows with the same index value. Types: series, writeseries, series_cache

Adding Columns

Removing Columns

Use the drop_column() method to remove columns:

# Drop a single column
df.drop_column('unwanted_col')

# The column is removed from both the data and column specifications

Note: This permanently removes the column. There is no undo.

Common Patterns

Pattern 1: Pre-creating Columns for Compute Operations

Some compute operations expect columns to exist:

# Pre-create columns that compute operations will populate
df['word_count'] = None
df['sentiment_score'] = None

# Then run compute operations that fill these columns
df.compute.run()

Pattern 2: Converting Column Types

To move a column to a different colspec type:

# Save the data
col_data = df['temp_col']

# Drop from current location
df.drop_column('temp_col')

# Re-add with new type
df.add_column('temp_col', col_data, colspec='output')

Pattern 3: Working with Series Data (Multiple Rows Per Index)

For data with multiple rows per index value:

# Create a series with repeated indices
series_data = pd.Series([1, 2, 3, 4], index=['a', 'a', 'b', 'b'])

# This automatically goes to a series-type colspec
df['multi_value_col'] = series_data

Best Practices

  1. Use add_column() for clarity: When the column type matters, use add_column() with an explicit colspec parameter.

  2. Use direct assignment for quick work: For temporary calculations or when working interactively, direct assignment (df['col'] = value) is more concise.

  3. Choose the right colspec type:

    • Use input for source data that shouldn’t change

    • Use output for results you want to save

    • Use cache for intermediate calculations

    • Use parameters for global configuration values

  4. Validate before dropping: Check if a column exists before dropping to avoid errors:

    if 'col_name' in df.columns:
        df.drop_column('col_name')
    

Common Errors and Solutions

Error: Column already exists

# ❌ This raises ValueError
df.add_column('existing_col', [1, 2, 3])

Solution: Use direct assignment to replace values, or drop first:

# ✅ Replace values
df['existing_col'] = [1, 2, 3]

# ✅ Or drop and re-add
df.drop_column('existing_col')
df.add_column('existing_col', [1, 2, 3], colspec='output')

Error: Column not found

# ❌ This raises KeyError
df.drop_column('nonexistent_col')

Solution: Check if column exists first:

# ✅ Check first
if 'nonexistent_col' in df.columns:
    df.drop_column('nonexistent_col')

Error: Invalid colspec

# ❌ This raises ValueError
df.add_column('col', [1, 2, 3], colspec='invalid_type')

Solution: Use one of the valid colspec types listed at the top of this guide.